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Series Editor's Introduction 


Professor Krippendorff has written an excellent introduction to infor- 
mation theory, particularly to its application for structural modeling. 
He provides a lucid discussion of essential topics, such as how to confirm 
aninformation theory model, its use in exploratory research, and how it 
compares with alternative approaches such as network analysis, path 
analysis, chi-square, and analysis of variance. This places information 
theory into a framework that most social scientists can readily compre- 
hend and evaluate. Professor Krippendorff's thorough understanding 
ofthe theory and use of information theory also takes the careful reader 
a long way toward competency. 

This book is particularly successful at making a rather complicated 
system for analyzing multivariate qualitative data as simple as possible. 
Krippendorff does this by building the entire presentation around intui- 
tively appealing notions of information, such as the amount of infor- 
mation provided by an answer to a question, the amount transmitted 
through a noisy channel, and so on, rather than by using the axioms and 
theorems of information theory. He also makes copious use of illus- 
trations designed to simplify and clarify the complex issues of structural 
modeling. 

Although this book is an introduction to a well-known but as yet 
underutilized topic, it does more than merely summarize current knowl- 
edge or present basic concepts. It presents new developments, including 
extensions of classical information theory to many variables, to circular 
causal processes and to complex models of qualitative data, the use of 
information theory as an analytical tool, the algebra of information in 
many variables, and a description of the algorithms needed for com- 
puter implementations. Much of Krippendorff's presentation is original 
and promises wide applications. It should serve equally well as a text- 
book and as a source book for social scientists and social researchers 
who are interested in communication explanations and information 
theory. We recommend it highly to researchers in communication 
theory, information science, and systems theory, and suggest that it be 
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studied carefully by social scientists interested in structural modeling, 
particularly in sociology, political science, and psychology. Professor 
Krippendorff has written his manuscript with these audiences in mind 
and has succeeded admirably. 


—John L. Sullivan 
Series Co-Editor 


FOREWORD 


The desire to take a fresh look at the topic of this volume is motivated 
by both a renewed interest in qualitative data and recent developments 
in information theory. Information theory is not merely a convenient 
statistical tool; it has an additional appeal to social scientists because it 
provides explanatory structures, theorems of considerable generality, 
and a powerful calculus for quantities of entropy, information, and 
communication—all of which are at the root of many social phe- 
nomena. The ideas and terminology developed in the first five chapters 
reflect this dual purpose by providing concepts that are both basic to 
social theory and introductory to the analytical machinery that follows. 

The book's main thesis derives from the extension of the original 
Mathematical Theory of Communication (Shannon and Weaver, 1949) 
to multiple variables (McGill, 1954; Kullback, 1959; Ashby, 1965, 1969) 
and to complex structures (Klir, 1976). In particular, this volume treats 
circular causal or simultaneous dependencies (Krippendorff, 1981) that 
escaped analysis by most established techniques, as well as penetration 
by traditional social science theories. The availability of electronic 
computers played an important role in forging these developments by 
relieving researchers of routine calculations and allowing them to adopt 
more powerful conceptualizations governing data analysis and explo- 
ration. Finally, multivariate information theory has acquired additional 
foundations in the work by theoretical statisticians, especially by 
Kullback, Mosteller, Goodman, Fienberg, Bishop, and others who 
linked these notions to the ongoing revolution in contingency table 
analysis, variance analysis, log-linear modeling, and Markov processes 
in particular. 

Although there are many modern facets of information theory, this 
book presents only what is needed to search for and test structural 
models of qualitative data; that is, models that exhibit complex relations 
among their component parts and rely on these relations to interpret 
given data. Communication or information transmission is just one 
attractive interpretation of such relationships. In this respect the book's 
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aim is primarily practical, providing tools rather than theorems. It starts 
with deliberate slowness. Chapters progressively build upon each other. 
The apparently independent Chapters 6 through 12 are tied together in 
Chapters 13 through 15. Readers can develop their own sense of closure 
even before getting to the end of the volume. 

Information Theory is written for advanced undergraduate and 
graduate students. It can be used as a text in qualitative data analysis 
and multivariate techniques courses. It should also be of interest to 
experienced social scientists who can afford to read more selectively. 
Communication researchers, information scientists, and systems the- 
orists might find these structural models particularly close to their 
theoretical concerns. 

Iam grateful for many valuable comments and suggestions received 
from colleagues, especially from Roger C. Conant, Alexander von Eye, 
Seth Finn, an anonymous reviewer, and from students at The Annenberg 
School of Communications, University of Pennsylvania, who used 


earlier drafts as a text. Finally, the work is unthinkable without W. Ross 
Ashby’s early influence. 


INFORMATION THEORY 
Structural Models for 
Qualitative Data 


KLAUS KRIPPENDORFF 
University of Pennsylvania 


1. QUALITATIVE DATA 


Qualitative data arise from distinctions drawn within a sample of obser- 
vations. The act of drawing distinctions makes the observations distin- 
guished of a different kind. No quantitative (ordinal or magnitudinal) 
differences are implied. Figure 1 depicts distinctions drawn by the New 
York Times Magazine (March 11, 1979) within 4,764 murder cases tried 
in Florida between 1973 and 1979. Each dot represents a case, and one 
may notice the conspicuous absence of death sentences for white 
convicts when the victim is black. Observations may be people, as in 
Figure 1, but they may also be messages, events, things, processes, 
anything individually describable, and any motivation to find differ- 
ences between them qualifies as a basis of drawing distinctions. There 
may be numerous distinctions. For example, individuals might be classi- 
fied by occupation, place of birth, sex, party affiliation, marital status, 
religion, criminal record, psychopathology, telephone number, types of 
people friendly with, personality type, languages spoken, magazines 
subscribed to, messages received, media use habits, type of car owned, 
reasons for visiting a physician, or drugs used—but none of them 
implies a continuum, and all are logically independent of each other. 

Qualitative data are also called nominal because the descriptions 
used are like names, making observations merely same or different 
without recognizing degrees (e.g., the use of social security numbers); 
categorical because observations are considered by their kind, category, 
orclass to which they belong; discrete because the boundaries make and 
mark a discontinuity in space; and freely permutable because the 
arrangement of observations is arbitrary and conveys no information. 
The latter is also connoted by the label unordered data. 
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re Victim 


White [Black Murderer 


Death [Other Penalty 


Figure 1 


Although many social science data are of this kind, the social sciences 
are not limited to qualitative data—consider such variables as age, 
income, time spent watching TV, and intelligence quotient, which are 
called quantitative because there can be more or less of it. Nor are 
qualitative data excluded from other fields of inquiry—consider the 
distinction between gaseous, fluid, and solid states 
positions of a relay or the states of a wh 
electrical engineering; 
structures in biology; and market compositions in economics, all of 
which are differentiate 
sometimes considered 
them probably the m 
found particularly wh 


interest or when a culture or its social insti 


of matter in physics; 


tutions prescribe in which 
products, or procedures are to be 
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Figure 3 


distinctions regard as of the same kind. Incidentally, the term count data 
reflects this somewhat less important convenience of representing by 
their number the observations that a classification scheme no longer 
distinguishes. Spatial representations can aid the recognition of pat- 
terns in qualitative data (see Figure 44 as an example of where the above 
may lead to) but become easily incomprehensible when more than three 
dimensions are involved. Cross-tabulations list the frequencies or prob- 
abilities of observations in tabular or matrix form, either breaking a 
multivariate space into separate subspaces or combining dimensions to 
obtain comprehensive tables. The latter is exemplified in Figure 5 by 
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Black Victim 11 2209 Black Murderer 
V = Victim 
A = Aggressor 
White Victim ——— 48 239 
| 0 lll White Murderer 
72—— T 2074 
Death Other Penalty 
O = Outcome 
Figure 4 
Time 2 


Exposure to Violent Programming: 


Displayed Aggression: 


Time 1 


Figure 5 


four-dimensional panel data, provided to Paul Lazarsfeld by NBC, 
Showing exposure to violent 


TV programming, E, and aggressive 
behavior, A, both recorded at two different points in time (Lazarsfeld, 
1974). 

Our definition of qualit 
tinctions are vacuous (i.e., they do not distin 
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a product of two variables, usually conceived of as a matrix whose rows 
and columns are labeled acA and beB, respectively, and whose cells are 
designated by abeAB. By extension, cross-tabulations or multivariate 
spaces of an indefinite but finite dimensionality will be referred to by 
ABC...Z and its cells by abc...zeABC...Z. In these terms Figure 5 is an 
EA X E'A' matrix of 16 eae’a’ cells. 


2. SELECTIVE INFORMATION 


Information is the key to my approach. Although we subsequently 
will revise this concept to meet the requirements of structural models of 
qualitative data, to begin with I define information as a measure of the 
amount of selective work a message enables its receiver to do. 

Accordingly, asking a yes-or-no question admits an initial uncer- 
tainty about what the correct answer might be, and the answer to such a 
question informs the questioner in the sense of “selecting” one of the two 
options he or she had in mind, thus removing the initial uncertainty. The 
answer to a yes-or-no question is taken to convey one bit of information 
which constitutes our basic unit of measurement. To capture this intu- 
ition, uncertainty U is defined by the dual or base-2 logarithm of the 
number N of options available. With reference to some variable A, 


U(A) - log, Ny [2.1] 


Accordingly, our yes-or-no question implies NA = 2 logical alternatives 
and represents log22 = 1 bit of uncertainty. 

Given that N refers to logical possibilities, each being of equal weight, 
U may be said to measure the logical variety in a descriptive system 
of categories. The attribute "logical" is important, as my definition of 
information does not yet extend to data, frequencies, or observational 
probabilities. Figure 6 lists some integer values. 

The amount of information a message—say, a—of the set of possible 
messages A conveys then becomes the difference between two states of 
uncertainty, the uncertainty U(A) before or without knowledge of that 
message and the uncertainty U(a) after or with knowledge of that 
message: 


I(aeA) = U(A) - U(a) = log, N, —log, N, [2.2] 


So if a decision maker must pick one of Na = 8 alternative courses of 
action and is given a report that shows that six of them lead to certain 


14 


Number of Logical Probability 
Options Bits of Options 
1 0 1. 
2 1 5 
4 2 25. 
8 3 1125 
16 4 0625 
32 8 «03125 
64 6 «015625 
128 7 .0078125 
256 8 .00390625 
512 9 .001953125 
1024 10 .0009765625 
2a 
N log, N N 


Figure 6 


failure, there remain N, = 8 -6 = 2 options to choose from, making the 
report worth 


I(Report) = Ust ^ U^ log,8 — log,2 2 bits 
of information, which is equivalent to receiving the answers to two 
yes-or-no questions. To remove the remaining uncertainty, the decision 
maker will have to gather one more bit of information or risk a 50% 
chance of failure. This risk is, of course, considerably less than the risk 
that existed before receiving the report. 

The connection between uncertainty and the risk involved in making 
wrong decisions leads to an expression of information as a function of 
the probability of selecting the desired set of alternatives by chance: 


N 
I(acA) 7log, N, - log, N. = _ log, X --log P: [2.3] 
A 
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plicative (see “log” in the List of Symbols). So when P, is the probability 
of guessing the answer to question A and P» is the probability of guessing 
the answer to an independent question B, the probability of guessing 
both correctly will be Pa = PP». 


I(abeAB) = —log, P, [2.4] 
=slog, P Pç 
= -log, P, - log, PE 
= I(aeA) + I(beB) 


It is this property that assures the additivity of information quantities— 
for example, that two floppy disks can contain twice as much informa- 
tion as one. For further explanations of these ideas, see Krippendorff 
(1975). 


3. ENTROPY, DIVERSITY, 
VARIETY 


Entropy is a measure of observational variety or of actual (as opposed 
to logically possible) diversity. Unlike the measure of selective informa- 
tion, entropy takes into account that messages or categories of events 
may occur with unequal frequencies or probabilities. The two measures 
are related, however, and one will be distinguished from the other. 

Reconsider the data in Figure 1. When all n = 4,764 murder cases are 
considered unique (as they no doubt are from the perspective of the 
individuals involved), the total amount of uncertainty as to which case 
we are referring to is log;4764 = 12.218 bits. After drawing distinctions 
suitable to an analysis and thereby putting several observations into one 
category on grounds that in some crucial respect they are the same, 
as seen in Figure 4, some uncertainty will be lost. The uncertainty lost by 
lumping 111 cases into one category is log;lll; in another it is 
1og:2074...; and in the generic a" category of the variable A it is U(a) = 
logons. On average, this uncertainty is 10.660 bits. A reasonable measure 
of the uncertainty remaining after such a multiple classification is the 
difference between the uncertainty in the sample before any classifica- 
tion and the average uncertainty such a classification loses. The resulting 
measure is called entropy. Without loss of generality, it is stated here 
for one variable (even so, our example could be seen as involving three): 
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H(A) = log, n = ÈS => (lg, n) [34] 


acA 


Whereas the selective information in 2.3 quantifies a simple reduction 
in logical possibilities, the entropy in 3.1 quantifies a reduction of n 
distinct observations to fewer categories. In our example, the entropy is 
12.218 - 10.660 = 1.558 bits. By rearranging the parts of 3.1, the entropy 


can be seen to be the average amount of information required to select 
(predict or identify) observations by categories: 


H(A) = > z (ios, 3) [3.2] 


acA 


By replacing the relative frequency πε/η with its limiting case, the 
probability pa, we obtain: 


H(A) =~ 2 p,log, p, [3.3] 


acA 


which is the most familiar definition of entropy and was introduced in 
this form by Shannon and Weaver (1949). (Although there are occasions 
on which the relative frequencies in a sample deviate from the probabil- 
ities in the population from which that sample was drawn, we will be 
concerned with this difference only when testing the significance of 
information quantities and will use relative frequencies and probabil- 
ities interchangeably otherwise.) In the one-variable “outcomes of 
murder trials” there are a subtotal of 131 death convictions that are 
much more difficult to guess (—logepacath penatty = 5.184 bits) than the 4,633 
other outcomes (-Ἴοβρροιν, = .0402 bits). Considering the different 


weights imposed by their rather unequal frequency of occurrence, 
~PotherlOg2Pother = .0391, the entropy 


~Pacath penattylOg2Ddeath ρεπαῖιν = .1426 and 
in this distinction sums to .1817 bits, quantitatively reflecting a rather 
predictable outcome. 

Part of the definition of entropy, and one reason for calling it a 
» is that unobserved Possibilities do not 
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0 < H(A) < H(A), = log;(min[N, ,n]) [3.4] 


max 


where NA is the number of categories in variable A and n is the sample 
size. The entropy is zero when observational variety is absent—that is, 
when distinctions are vacuous and all observations are of the same kind, 
Pa= 1 for one category and zero for all others, in which case both 11921 = 0 
and by convention 0log20 = 0. The entropy is maximum when the Na 
cells are occupied either by the same number of observations, na = n/Na, 
in which case frequencies and probabilities are uniformly distributed, or 
by πι = 1 at most, in which case all observations are unique. Thus the 
amount of uncertainty (2.1) is a limiting condition for quantities of 
entropy. In the example, the victims' race, with its almost uniform 
distribution of 51% and 49%, measures an entropy of .9997 bits and is 
near its maximum of | bit. 

Entropies do not respond to the nature of the categories involved. 
Their labels are freely permutable. Only the set of frequencies or proba- 
bilities matters. It is in this sense that entropes are said to be content- 
free. What is true for the arrangement of values in a variable is also true 
for the arrangement of cells in a matrix or space of greater dimension- 
ality. The entropy 


H(ABC 2)Ξ- > ^» E b» Pabe..z log, Pabe..z [35] 


acA beB zeZ 


which is an obvious generalization of 3.3, reflects more and finer dis- 
tinctions than those drawn by any single variable, but it too mea- 
sures nothing other than a dimensionless collection of frequencies or 
probabilities. 

Unlike variance measures of deviations from a mean which assume 
normal distributions, entropies assume nothing about the nature of the 
frequency or probability distribution they assess and are thus non- 
Parametric measures of variety and entirely general in this respect. 

Entropies are averages. Figure 7 illustrates how one might interpret 
them as an average number of binary decisions made in the course of 
classification. This partition of 32 events measures 1.875 bits. The 
decision tree recursively divides these events into equal parts, each 
amounting to 1 bit. However, after the first distinction, the second is 
made in only half of the cases and thus contributes only .5 bits to the 
measure. The third distinction is made in only a quarter of the cases and 
contributes .25 bits, and so on. These four contributions add to the total 
of 1.875 bits, QED. It follows that 1 bit of entropy may reflect not only 
two equally likely alternatives but could also arise from decisions among 


n=32 


1 bit 
1 
+ S bi) 


EL 


1.875 bits 


Figure 7 


more than two rather unlikely cases. Figure 8 exemplifies several distri- 
butions, all of which measure approximately 1 bit. : 
Entropies are a function of relative magnitudes, probabilities being 
the most common form. The sample size does not influence the entropy 
values (except in the form of a statistical bias, which will be considered 


later). Entropies may be standardized or expressed relative to their 
maximum: 


HO [3.6] 
HA. T 


max 


Standardized or relative entropies no longer express the magnitude of 
diversity or variety (analogous to variance) but may be interpreted as an 
index of uniformity (analogous to the standard deviation). Entropy 
measures provide access to a rich source of data for the construction of 
theories in which variety, diversity, and differentiation are the target of 
generalizations. For example, early studies in psychology of absolute 
Judgments led to generalizations of human information-processing 
limits over several sensory domains (Miller, 1956; Attneave, 1959). The 
entropy of prose (Shannon and Weaver, 1949; Weltner, 1973) has been 


r, 1953, 1956), with English profi- 
With reader enjoyment (Finn, 1985). 
Similar intentions led to applications of entropy measures to art and 
aesthetics (Bense, 1956: » 1959; Moles, 1966; Berlyne, 1971), 
newspapers (Schramm, 1955), television programming (Watt and Krull, 
1974; Watt and Welch, 1983), and to the instrumental and functional 
complexity of cultural objects (Moles, 1960). 

: To test whether the press fulfills its Promise of keeping the public 
informed, Chaffee and Wilson (1977) used entropies to measure the 
diversity of public opinion in media-poor and media-rich environments. 
Danowski (1974) and Danowski and Ruchinskas (1979) correlated the 


entropy of media exposure with aging and with the complexity of 
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interpersonal networks. Entropy is also the primary target of what has 
become known as the convergence model of communication (Rogers 
and Kincaid, 1981). It suggests that communication processes change 
the distributions of beliefs, values, and behaviors within a population 
and reverse the natural tendency toward increasing entropy. 

Entropy is also the target of many processes of social control. For 
example, political succession can be seen as reducing the great variety of 
aspirants to a political office until the last uncertainty is removed by 
ballot. In a pedagogical example, Lachman, Lachman, and Butterfield 
(1979) cite an entropy of nearly 4 bits from November 1975 data on the 
probable success of some 17 candidates for the U.S. presidency. The 
preelection process reduced this entropy to nearly | bit (characterizing 
the nearly equally likely success of Carter and Ford before election day 
on which voters removed the remaining uncertainty). 

Finally, entropy is also the key to a fundamental law in the cyber- 
netics of regulation. Ashby's (1956) law of requisite variety, which states 
that *only variety can destroy variety," implies that the survival of a 
system depends on its ability to generate at least as much variety within 
its boundaries as exists in the form of threatening disturbances from its 
environment. In light of such a fundamental condition, many entropy 
measures gain social importance. For example, Theil (1972) reported 
studies measuring occupational diversity in cities, industrial concentra- 
tion in the United States, and the entropies of employment, markets, 
income, and political representation, all of which can be linked to the 
growth and decline of social systems. Montroll (1983) applied the entropy 
function to Sears catalogues and showed that the company's success 
depended on keeping variations in the entropy of prices nearly constant. 
Galtung (1975) related entropy to a general theory of peace. 

Subsequent chapters take the wide applicability of entropy measures 
for granted, avoid using them as measures or indices in their own right, 
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and focus instead on the analytical opportunities they offer, Shannon's 
theory of communication being an early recognition of these. 


4. SHANNON'S THEORY OF COMMUNICATION 


Shannon's widely publicized model of communication (Shannon and 
Weaver, 1949) is a chain of processes as shown in Figure 9. The model is 
of considerable generality. The labels on its boxes do not matter but 
merely exemplify one interpretation. The model is applicable not only to 
mediated communication (e.g., telephone, newsprint, computers) but 
also to the flow of orders through a chain of command, to the sequential 
analysis of data in the course of a scientific experiment, or to infor- 
mation processing within an organism. What is important is that each of 
these boxes is described by a transformation, with variable inputs, 
variable outputs, and transition probabilities connecting the two. A 
transformation so described is also called a code. Shannon’s theory 
keeps track of information flows through such coding processes, quanti- 
fies channel capacities, redundancies, and errors, and offers various 
theorems relating them. 

To understand any one of Shannon’s boxes in this chain, we must 
identify two sets of categories, whether they be signals, messages, 
judgments, courses of action, types of contents, or 
or output. We must then ascertain the connections between them. 
Figure 10 gives four such examples sagittally and tabularly. Ina Perfect 
Channel, the messages sent and the messages received correspond one to 
one. They do not need to be the same—as in translations or in sound 


recordings—as long as they are not irrecoverably mixed. In a per- 
fect channel of communication, encoding and decoding are inverses of 
each other. Imperfect channels entail t 


Wo kinds of errors: noise and 
equivocation. 


patterns in the input 


Noise and Equivocation 


Noise occurs when a sender cannot be certain about how th 
is received. In Figure 10 this is indicated by branching ron Wein : 
or by two 


Noise 


Equivocation 


Figure 9 


Perfect Channel Noise Only Equivocation Only Mixed 


Sender A Receiver B 


Receiver B 


Sender A] ° + 


Figure 10 


or more entries per row. The term is borrowed from acoustical experi- 
ences in telephone communication that make hearing difficult and is 
generalized here to refer to all unexplainable variation, including the 
static on a TV screen and incomprehensible rhetoric. Noise need not be 
undesirable as in creative pursuits or in political discourse, in which 
ambiguity may be intentional. Noise simply measures the input-unre- 
lated variety in an output stream. 

Equivocation occurs when the receiver is unable to differentiate 
between two or more messages sent. In Figure 10 this is indicated by 
converging arrows or by two or more entries per column. Equivocation 
can be taken to mean “regarded as equal” and occurs not only when a 
receiver cannot make out which message was intended but in all efforts 
to classify, abbreviate, simplify, or abstract. Equivocation measures the 
variety removed from the input stream. 

The mathematical machinery for analyzing such situations requires 
data on either (a) the probabilities or frequencies with which messages 
are sent plus the transition probabilities of how each sent message is 
received, (b) the probabilities or frequencies with which messages are 
received plus the (inverse transition) probabilities of how each received 
message was sent, or, finally, (c) the probabilities or frequencies of all 
transitions between or cooccurrences of values from two or more vari- 
ables. Figure 11 defines the relevant probabilities and frequencies in 
terms of (c). (For notational simplicity we are now taking for granted a's 
membership in A, b's membership in B, and so on and drop referencesto 
relative frequencies in preference to probabilities where practical.) 

According to 3.2 and 3.3, the sender's entropy in the vertical 


(row sums) in Figure 11 is ee We, 
519-7 6.CE. R.Y., West asoga A vay 
κ. ὃ α Bate. NA p 


$^ 
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n n 
a a 
H(A) pr? log, A log, 


Ed 
The receiver's entropy in the horizontal margin (column sums) is 
T n 
H(B) - - 2 p, log, p. = 27 log, — 
v wn n 
and the total entropy in the table of cooccurrences is 
Tap Nay 
H(AB)=-)) >) Pap log, pa, = ee u^ > 
a b a 
Without reference to marginal entropies, 


H(AB)’s absolute limits are as 
in 3.4, and with reference to marginal entropies, H(AB)’s relative limits 
are 


max[H(A), H(B)] < H(AB) < H(A) + H(B) [4.1] 


In4.1 the minimum entropy Tepresents the case in which either the rows 
or the columns of the transition matrix hav 


entry each, the correspondin, 
the relation manifest in this 


€ no more than one non-zero 


sum of the two marginal en i 
4.1 implies that the proba 
distribution need not be c 


property in 2.4. Now 
bilities ma = p.p, of this maximum entropy 
omputed explicitly. 


Receiver's States 


Receiver's States 
beB 


beB 


Sender's 
States 4ΕΑ. 


Figure 11 
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The amount of noise is variously defined by 


HB, (B) = H(AB) - H(A) [4.2] 
=) p, Η,(Β) 
-Σὺ», E Pola 168; Ἢ 


that is, as the algebraic difference between the joint entropy H(AB) and 
the sender's entropy H(A), as the average of the entropies H.(B) in the 
rows of Figure 11, the latter being expressed either in terms of condi- 
tional probabilities py, or relative frequencies πευ/πα. If each message 
Were received as one and only one message, then transition probabilities 
Poja = 0 or 1, all row entropies Ηι(Β) = 0, H(AB) = H(B), and noise 
is absent. Positive quantities of noise therefore indicate the confusion 
the known message a causes in the receiver B. The amount of noise is 


limited by 
max[0, H(B) - H(A)] < B, (B) < H(B) < H(AB) [4.3] 


The amount of equivocation follows the same logic except that the 
positions of sender and receiver and, consequently, the references to 
rows and columns are reversed. If Ha(B) measures the noise in a com- 
munication channel, H»(A) becomes its equivocation. 


Information Transmitted 


The amount of information transmitted through a channel can also 
be expressed in several conceptually different but mathematically equiv- 
alent ways. As the difference between the maximum entropy and the 
Observed entropy, 


T(A:B) = H(A) + H(B) - H(AB) [44] 


as the difference between the receiver's entropy and that part of its 
entropy which is noise, 


T(A:B) = H(B) - H, (B) [45] 


24 


as the difference between the sender's entropy and that part of its 
entropy lost by equivocation, 


T(A:B) = H(A) - H,(A) [4.6] 


and in terms of probabilities or frequencies, 


Pap 
T(A:B) = ΣΣ Pap log, > [4.7] 

a b Tb 

n 

b 

Pap To i 

= log, —— = — log, nn 

ΣΣ», > pup ΣΣ ni ELM 

n 


The probabilities 7a» = PaPp are expected when sender and receiver do not 
communicate and operate independently of each other. The so-called 
maximum entropy probabilities provide the standard against which the 
Observed probabilities p are evaluated explicitly in 4.7 and implicitly in 
4.4 through 4.6. The log»p/ is also known as the log-likelihood ratio. 


The relations among these five measures can be visualized with 
Figure 12. It depicts the flow of infor: 


components of a communication 
the sender's entropy, 


mitted, and noise ad 


The limits of the amou 
with 4.4. When H(AB) is 


ini ;then the smaller of the 
two, H(A) or H(B), remains. Hence: 


0< T(A:B) < T(A: B) nax = min[H(A), H(B)] [4.8] 


leading to the so-called index of predictability: 


We will generalize the maximum amount of info ation in 8.9 and the 
rmati 
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Channel 


Receiver's Entropy 


ë š Information Transmitted H(B) 
Sender's Entropy T(A:B) 


H(A) 


Equivocation 
H(A) 
Figure 12 
Second Interview B 
R+ R- D+ D- 

S Republican for Willkie 
First i " ama 
Interjew Republican against Willkie 


A Democrat for Willkie 
Democrat against Willkie 


Figure 13 


tion quantity, we apply three alternative computations to the data in 
Figure 3, now summarized in Figure 13. Although these data are not 
literally about sending and receiving messages, they do conform to the 
requirements, particularly to form c. The absence of information trans- 
mitted from the first to the second interview would mean that changes 
Occurred at random, whereas non-zero amounts would indicate that 
attitudes in the second interview are to an extent indicated by T(A:B) 
predictable from knowledge of the first, the retention of these attitudes 
offering the most obvious explanation. 


Applying 3.3 and 3.5 directly to the table and its margins (using "A" 
to designate the first interview and “B” the second) yields 
H(A) = 1.705 bits 
H(B) = 1.575 bits 
H(AB) = 2.227 bits 
o inter- 


By 4.4, the amount of information transmitted between the tw 
Views is 
T(A:B)- H(A) + H(B) - H(AB) = 1.054 bits 
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This quantity is statistically significant beyond reasonable doubt, an 
issue addressed in Chapters 10 and 11. By 4.8, T(A:B)max = 1.575 bits, 
and the difference between the observed and this maximum quantity is 
the result of some voters changing their minds. These cases are found in 
the off-diagonal cells of the matrix and constitute noise. : 
Although this computation of the information quantity is straight- 
forward and simple, it does not shed light on where changes have 
introduced uncertainties and how they affect the measure. To highlight 
such analyses, Figure 14 demonstrates a second approach. It shows the 
computation of the quantity of noise, using 4.2. We see the conditional 
probabilities Ρε. tabulated and the entropies H.(B) associated with each 
row of this matrix. Both respectively exhibit and indicate the results 
obtained during the second interview given the outcome of the first. The 
noiseHa(B) is the average row entropy H,(B). In the absence of any 
pattern across time, we would expect the distribution of conditional 
probabilities Pap to replicate the unconditional probabilities p», in which 
case all Ἡ.(Β) = H(B) and T(A:B) = 0. In the other extreme, when the 
second interview merely replicates the first, all observations would turn 
up in the diagonal, each row would have only one occupied cell, all 
H.(B) = 0 and T(A:B) = H(B), which is the maximum information 
retainable in this case. In the example all entropies Η.(Β) are non-zero 
and smaller than H(B). We also note that most changes understand ably 
occur in the two rows for the initially conflicting categories (R-) and 
(D+), which are indicated by an entropy markedly higher than in the 
rows for the consistent categories, (R+) and (D-). But because the two 
conflicting categories occur less frequently than the other two, their row 
entropies also contribute less to the total amount of information in the 
data (last column in Figure 14), Subtracting the quantity of noise, 
Ha(B), from the entropy in the second interview, H(B), as in 4.5, again 
yields T(A:B) = H(B) - H,(B) = 1.575 - 522 = 1.054 bits of information 


Pola Pa H, (B) pH, (B) 
a= R*.956 .022 00} 015 .507 327 .166 
R-|.314 .657 .000 .029 32 1.069 141 
D+ 1042 000 500 458 090 1.207 109 
D-|.014 014 028 944 271 393 .106 
Pp H(B) 7 H,(B)= T(A: B) 
534 .102 056 30g 1.575 522 1.054 
Figure 14 
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retained between the two interviews. With this simple entropy differ- 
ence, the quantity implicitly compares the conditional probabilities py, 
with the (unconditional) probabilities ps row for row. i 
The third approach to computing information quantities is illus- 
trated by applying 4.7. It compares the observed probabilities pa» with 
the maximum entropy probabilities zr» that are expected under condi- 
tions of independence. Both of these and the weighted log-likelihood 
ratios are givenin Figure 15. In the top and leftmost (R*R*) cell ofthese 
matrices we find the observed probability p.s = 129/266 = .485 to be 
larger than the expected probability ma = pap» = .508 X .534 = 271; the 
latter would have been observed had the attitudes expressed during the 
two interviews been unrelated or changed at random. In this cell obser- 
vations exceed expectations by a factor of pa/ Ta = 1.790. Because this 
likelihood ratio exceeds unity, the log-likelihood ratio is positive. The 
weighted log-likelihood ratio contributes palogopas/ Ta = .407 bits to the 
total amount of information. Figure 15 shows these contributions for all 
cells. It turns out that all diagonal cells are positive as well. The signs of 
the weighted log-likelihood ratios indicate whether observations are 
above (plus) or below (minus) expectations. The sum over all of these 
quantities is the amount of information retained between the two inter- 
views, or T(A:B) - 1.054 bits as before. Again, regardless of how the 
amount of information is computed, the resulting quantity expresses the 


above comparisons implicitly. 


Tab = Pa Pb 


271 .052 .027 .156 


.070 .013 007 .047 | 
.048 .009 .005 .028| 
145 028 .015 .028 


1.054 bits 
=T(A:B) 


Figure 15 
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Redundancy 


One of Shannon's most celebrated contributions is the proof ped 
noise that detracts from the amount of information otherwise pH - 
table can be counteracted up to an arbitrarily small error either y 
additional correction channels of a capacity equal to or exceeding the 
amount of noise entering the communications or by coding an equiv- 
alent amount of redundancy into the channel. Familiar forms of μη 
latter are repetitions of the messages sent or the use of fewer than ia 
possible messages, including parity checks of various complexity. T 1S 
gives rise to measures of redundancy. For simple entropies within a 
communication channel, redundancy is the difference between the 


entropy of a uniform distribution and the observed entropy and is an 
information measure in its own right: 


T(A) = H(A) nax — H(A) [4.10] 
or expressed as an index: 


ου ELA ' [4.11] 
H(A) nae H(A) nax 


The latter led Shannon and Weaver (1949) to observe that the English 
language is about 50% redundant, a figure that other researchers have 


accounts for the fact that we can 


t n transmitted can be generalized to 
many variables (McGill, 1954; Ashby, 1969): 


T(A:B:C:...:Z) = H(A) + H(B) +H(C)+...+H(Z) [412] 


- H(ABC...Z) 
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tioninachain A-B—C-D- ...—Z, similar to Figure 9, turns out to be 
the sum of the transmissions in each of its components: 


T(A:B:C:...:Z) = T(A:B) + T(B:C) [4.13] 
* T(C:D) * ... * T(Y:Z) 


The model assumes that communication between A and C, A and Z, B 
and Z, and so on and all higher-order interactions are absent. For such 
chains a bottleneck theorem states that the amount of information a 
chain can maximally transmit from its input to its output cannot exceed 
the amount transmitted by its weakest link: 


T(A:Z) < min[T(A:B), T(B:C), T(C:D), ... T(Y:Z)) [4.14] 


Subsequent chapters expand Shannon's original conceptions. 


5. COMPARISONS OF 
QUALITATIVE VARIATES 


: Situations may arise in which comparisons of simple measures of 
diversity will not suffice. Recall Figure 8, which depicts rather different 
distributions of equal entropy. We are concerned here with comparing 
two frequency or probability distributions within the same variables. 
Consider a hypothetical set of data created after Theil (1972), who 
studied racial segregation in Chicago with entropy measures. 

Figure 16 lists the racial composition in all five schools of a district. 
The first and largest school is predominantly black. The second is 
predominantly Caucasian, with the fifth and smallest being nearly exclu- 


Races b of B 


Black Caucasian Hispanic Asian Racial Entropy (Bits) 


a-l| 42 91 151 0 H,(B) = 1.294 
af ur 257 37 66 H,(B) = 1.292 
3| 58 68 68 68 H,(B) = 2.000 
4| 5ο 51 25 14 H,(B) = 1.837 
5 1 98 1 2 H,(B) = .307 
Total 552 565 282 150 H(B) = 1.836 


Figure 16 
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sively so. Although there are apparent differences in the first two schools' 
composition, both have nearly the same racial entropy of 1.292 bits, 
which shows that the content-free quantities of entropy by themselves 
say nothing about differences between two frequency distributions and 
hence about racial discrimination or bias in this case. Expectations are 
important. If race does not enter a school's admissions policy, we would 
expect the racial mixture within each school to resemble that in the 
school-age population of the surrounding community as a whole. This is 
clearly not the case in the first, second, and last schools, which have 
unexpectedly large numbers in one category, or in the third school, 


which, despite good intentions, employs an equal quota system. We 
approach such comparisons in two ways. 


Informational Distance 


One way is suggested by comparing one row with the aggregate sum 
of all others. In effect this means rearranging the data in Figure 16 into 
Several 2 X Np matrices as in Figure 17, in our example, one for each 
School. The amount of information transmitted between the two vari- 
ables, a versus not-a and B, as obtained by 4.4 through 4.7, measures the 


dissimilarity or difference between one row and all others and is called 
the informational distance T(aa:B): 


i Py Pap 
Pay) 08, (1 = 


hah Pab 
T(aā: B) =), Pap log, = +> (p, — 
b P,P, b P,)P, 


[5.1] 


T(aa:B)iszero or positive. It is maximum if, whenever Pab iS Zero, ps — Pab 
is non 


ero and v ceed 1 bit. It is zero when the 
probability distribution Pap in a equals t 


able to multidimensional scaling analyses. 
In our example the us rmational distance T(aa:B) as a 


marred by its otherwise useful 
Symmetry. It responds to the volu 3 
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1 beB ΝᾺ 
acA Pap 
SS Pp Pap 
Pb 
Figure 17 


between the second school and its environment is .185 bits, whereas for 
the last school, which is in fact more selective, it is only .081 bits. 


Informational Bias 


The second measure compares observed and expected probabilities 
as well but only within any one row (or column, as appropriate). The 
informational bias 


1 Pab 
T(a:B)= — 2.) Py, 108; ——— [5.2] 
Ρα b ΡαΡο 


consists of only the first part of T(aa:B) in 5.1. Observations in this a^ 
row have the status of a subsample, and T(a:B) measures the degree to 
Which that subsample differs from the whole sample of which it is a part. 
In the algebraically equivalent forms, 


Ρε] c 
T@:B) => Pola 1083 ' a 9} Pp|a log; Pp H,(B) 
b 


anentropy in B, which 
and the (conditional) 
e total amount of 


the measure appears to be the difference between 
is weighted not by py, as in H(B), but by Pija» 
entropy H.(B) in the a" row. T(a:B) is related to th 
information in a matrix by 


Pg 
ra: -2227 Palo 5 pe à Pa T(a:B) 


and to T(aa:B) by 
T(aā: B) =P, T(a: B) + p; T(a:B) 


rmational bias is not symmetrical regarding the two 


Obviously, the info : 
therefore cannot be interpreted as a 


distributions, T(a: B) € T(a: B), and 
distance function. 
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= = T(a:B) 
Pola a Pola 1082P, — H,(B) e 

a=1 .636 .137 .227 .000 1.701 1.294 .407 

2 |.029 .693 .100 .178 1.896 1.292 .604 

3 250 .250 250 .250 2.187 2.000 .187 

4 |.357 .364 .179 .100 1.837 1.837 .000 

5 |.010 .959 .010 021 1.505 .307 1.198 

ci 
Py HO) - HA(B) = T(A:B) 
351 363 .183 .097 | 1836 | [αν] | 431 


Figure 18 


For the racial segre; 
bilities Pija 
the entropy 


gation data, Figure 18 lists the conditional proba- 
the probabilities p, to which the former are compared, and 
components leading to T(a:B) to the right of this matrix. 
Informational bias measures reveal, what is intuitively rather obvious, 
that the last school follows the most strin 


making, on average, more than one decisio 
racial groups out. All of these measures ex 
significant beyond reasonable doubt. 
Bias measures of this ki 
alized in Chapter 13 for e 


gent discriminatory policy, 
n per student to keep certain 
cept for the fourth school are 


nd have many applications and will be gener- 
xamining strata in complex models. 


6. STRUCTURAL MODELS 
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of Maximum Entropy 
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within Parameters 


Model-Generated Data in AB... .Z 


M 
todel m H(m)= H(K,:K,:...) 


Test for 
Goodness of Fit 


Parameters Κι. Κι... of Model 


Observed Data in AB... Z 
H(m,) = H(AB...Z) 


Reality m, 
as Seen by an Observer 
in Variables A, B. ....Z 


Reality of n Observations. 


Equivocation 


Figure 19 


sent here the logic of such models, leaving quantitative accounts for 
Chapters 8 and 12. 


Parameters 


The parameters of a structural model are relations within selected 
Subsets of the variables modeled. For example, Shannon's model of 
communication consists of a sequence of bivariate components, each 
realizing a relation in pairs of variables, ultimately linking an input to an 
Output through a chain of components, excluding all bypasses and 
higher-order interactions. The choice of parameters may have technical, 
theoretical, empirical, or even aesthetic motivations. A technical reason 
for adopting certain parameters might stem from knowledge of the 
System modeled. If two variables are not connected in reality, an appro- 
Priate model need not consider this relation. A theoretical reason might 
rely on the dependencies that an existing theory anticipates. An ul 
ical reason might point to evidence that the parameters chosen are those 
minimally necessary to reproduce the given data. An aesthetic ΓΗ͂Ν 
might be based on preferences for certain kinds of explanations—simple 
Ones, for instance. 

When data are quantitative, paramet 
matical functions. For example, the equa 
linear relation between the two variables 3 
exhibit such a one-to-one relation (function), as the coefficient By 
implies, an error term ey is added to express the deviation in Y from e 
ideal line and yields a closer approximation to what the data actually 


ers may be defined by mathe- 
tion y = rs(x) + e; describes a 
X and Y. Because data rarely 


Y Observed xy Pairs φ. 
9 LJ 
o9 
Ff 20 
μιὰ o9 
3 20 
° 
° 
° 
° 
9 
e° 
° x x 
Figure 20 


Show. Figure 20 depicts three quantitative relations illustrating this 
point. 

When data are qualitative, functional expressions such as those 
underlying Figure 20 are inappropriate because addition and multipli- 
cation do not apply to unordered variables. What would be appropriate 
here is the use of the very distribution of cooccurrences in the original 
data—for example, data in Figures 5, 10, and 13, the distribution of 
Observed pairs in Figure 20, all of which constitute the most obvious and 
uncontaminated manifestation of relations among variables. Observed 
Cooccurrences may not have the aesthetic appeal that mathematical 
equations have, but because all equations can be represented distribu- 


tionally, as Figure 20 shows, the form is not only the more universal of 
the two but also avoids several e 


y of them assuming linearity, as 
he very distribution of observa- 


ould have been estimated avoids 
plicity. 


contrasts, differences, causes) at 


interactions. Even Sh 
limited in this way. A: 
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models, parameters are represented by boxes to which some variables 
are attached by lines. We say “attached” because these variables may be 
inputs or outputs, as in Shannon's communication chain, or simply 
Observed variables without causal implications. The number of vari- 
ables involved in a parameter equals that component's ordinality. The 
second model in Figure 21 contains one fourth-order component, 
ABCE, and two third-order components, BCD and CDE. Researchers 
familiar with path diagrams and causal networks, in which nodes repre- 
sent variables and lines represent influences, must make a gestalt switch 
here, converting lines into boxes and nodes into lines. Block diagrams 
are capable of representing higher-order interactions (in boxes). Graph- 
ical devices that represent influences by arrows or dependencies by lines 
between variables cannot capture anything above an ordinality oftwo. 

We can think of a component of an ordinality of one as a simple 
random generator that reproduces the distribution of frequencies or 
probabilities as originally observed in the one variable attached to it. By 
extension, a component of higher ordinality can be thought ofasa 
random generator that reproduces the distribution of frequencies or 
probabilities of cooccurrences through which the observed relation 
between the attached variables is empirically manifest. 


Composition 


A structural model consists of several components, each specified by 
a different parameter with respect to which it corresponds to the data to 
be modeled, and none is included or equivalent to another. Y consider 
here the logic for composing such models (sce Klir, 1976, 1981) and 
define appropriate terms. 


A modelis said to cover the variables it models. Figure 21 shows four 


different models covering the same set of five variables. The first does 
not differentiate components. It represents the data in their original 
form byasin gle, all-encompassing component of ordinality five without 
simplification and is called the saturated model. I refer to this extreme 
case of a model by mo. Here the saturated model is m, = ABCDE. The 
Second consists of three components, Kı = ABCE, K> = BCD, and K; = 
CDE, and is denoted by mi = ABCE:BCD:CDE. The third is derived 
from mi by omitting the CDE component, which does not change its 
Cover. The fourth consists of five components, one for each variable; 
and because they work entirely independently of each other, we denote 
this condition by the subscript "ind." The model of independent vari- 
ables is ming = A:B:C:D:E. 

_ Amodel’s components must be 
tive to one another and are connecte 


neither included nor equivalent rela- 
d by the variables they share. One 
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B (C D E 
ABCDE A A 
s S c : TEREE 
u ERE K B 
= i DE 
m,7ABCDE — m = ABCE: BCD: CDE m, = ABCE: BCD Mia ΑΤΒΙΟΤΙ 
Figure 21 
E BG D 

Figure 22 


component is included in another if all ofthe former's variables are also 
variables of the latter. Two com: 


same variables. The restriction is moti 
tion of a relation already explain 
Figure 22 shows three mere gra 
ABCE:BCD of Figure 21 in wh 
crossed and should be omitted. 

In the model ABCE:BCD:CDE of F 
unique to the first component Κι = 
component of that model. BC is sh 
component, and we note this fact 


Ki&K3= CE and K;&K; = CD. Ki&K2&K; = 


phical variates of the model m; = 
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nent i 

T e eus particularly all relations of an ordinality less than the 

τ πο icem Thus the parameter ABCD contains four 

AR AC AD pe Pa ABD, ACD, and BCD—six binary relations— 

a o oh x 5 , and CD—four unary relations in separate 
, B, C, and D—and the nominal ¢. The parameter ABC 


<ABCD> 


Figure 23 


ë š 
σιῶν. τ. AC, BC, A, B, C, and ϕ, all of which are already contained 
B bine : 1s embeddedness motivates the demand that the compo- 
dium Lt are neither included nor equivalent relative to each 
Bossa to calling such models hierarchical. ewe 
yarables. fÉ there arises the need to consider what is unique to aset of 
AC, BC. or A what is unique to ABC and not reducible to AB, 
(me τμ to whatany of these contain. This is called an interaction. An 
αν is a unique dependency from which all relations of a lower 
omitted z are removed. All of its variables are essential; none can be 
Ganbo a Ti this definition interactions are not embedded ineach other, 
luris o to be the additive content of a relation, and form Boolean 
eters dh e use < and d to distinguish interactions from the param- 
iher at contain them. Figure 23 depicts the interactional content of 

ation ABCD. 

Mos for the satured model, 
Hong ος the original data, al 
ABC: aig 23 also depicts the interac . 
from thi D, using heavy lines to connect them. Interactions excluded 
ik ive is model are connected by fine lines, and interactions shared by 
this f o components are connected by two heavy lines. As indicated in 
tices igure, in lattices of all possible interactions, relations form sublat- 
, One for each component and one for each shared set of variables. 


mo, which contains all interactions 
Il other models exclude some interac- 
tions contained in the model 
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Relations Between Models: Descendency 


Two models are said to have the same structure or are of the same 
structure typeif one can be obtained from the other byamere one-to-one 
i i i BCE:BCD:CDE 
relabeling of its variables. For example, the models A 
(see Figure 21) and ABCE: ABD: ADE have the same structure because 
the latter can be obtained from the former by exchanging A with Cin all 
components. Models of the same structure yield the same block dia- 
grams except for the labels on their connecting lines. Block diagrams 
without labels depict structure types (see Figure 25). . 
Structural modeling often requires comparisons of models. For this 
purpose we define the notion of descendency. One model is said to bea 
descendent of another if all relations in the former model are included in 
the latter and both models cover the same variables. Descendent models 
are also called nested models. For example, the model AB: AD: BCDE is 
a descendent of ABD: BCDE because AB and AD are included in ABD 
and BCDE occursin both. Two models of which neither is a descendent 
of the other are incompatible. For example, the model AC: BCDE, which 
contains AC, isincompatible with ABD:BCDE and AB: AD: BCDE be- 
cause AC is absent from both, We denote descendency by an arrow from 


anancestor to its descendent: for example, ABD: BCDE— AB: AD: BCDE, 
or more generally mi-m;. 


A model is a direct or imme 


ABD: BCDE 7 AB: AD: BCDE > AB: BCDE > A: BCDE 
7 A:BCD:BCE:BDE:CDE 


Here interactions <ABD>, «ΑΡ», «ΑΒ» and <BCDE> are removed 
in this order, reaching A: BCD: 


BCE:BDE:CDE in four steps. Klir 
(1976) calls these Steps “ as they introduce sim- 
pler components. An al a model's immediate 
descendents is givenin 
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B) m, = ABCDE 
i 
Eum = AC: ABD: BCDL 
3 
ΓΕ m, = AC: BCDE 9 
3 AHH = ABD: CD: CE 


3 
DO mods,» A: BD: CD: CE 


a 


mlam m] mpa 2 A:B:C:D:E 


Figure 24 


only the components of the models 


being compared, except those included or equivalent relative to each 
other. For example, when putting the two incompatible models 
AC:BCDE and ABD:CD:CE together, CD and CE of the second model 
are contained in BCDE of the first and are redundant in any common 
ancestor. The remaining components, AC:ABD:BCDE, constitute the 
nearest common ancestor of the two models. The nearest common 
descendent contains all and only the interactions shared by the models 
being compared. Using the same two models as anexample, ABD:CD:CE 
containsCABD», <AB>, <AD>, <BD>, <CD>, and <CE>, of 
which only the last three are shared with AC: BCDE. Given that none of 
the interactions involving A is shared, the nearest common descendent 
will include A as a separate variable ‘and BC, CD, and CE as compo- 
nents. The resulting A:BD:CD:CE, retains all shared interactions, pre- 
serves the distinctions made by either model, and is their nearest common 
descendent. 

As already mentioned, generic refere 
mj referring to a model in general, or Ki 
T components. The nearest common ance 
by "miU m;" and the nearest common descent by m; mj. The saturated 
model, πιο, representing the data without simplification, can j 
most distant common ancestor, and Mind, the least complex of all possible 
models, can be called the most distant common descendent. It follows 


that for any two models, covering the same variables 


ancestor is composed of all and 
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d 
m, > mUm > m > mm, > mia [6.1] 


> > > > 
m, m,Um, mj mini Mind 


These relationships define a lattice, which is depicted in Figure 24 using 
the above models as an example. The numbers indicate how many 
generations these models are apart. 

Lattices of structural models as in Figure 24 differ from the Boolean 
lattices of all possible interactions as in Figure 23. Lattices of all possible 
models provide important guides for analysts of qualitative data to find 
their way through the forest of models to be considered. They also form 
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Figure 26 


€ e i: computer programs for exploratory analysis. To appreciate 
res [9 lattices we have to be concerned with, Figure 25 depicts two 
rsions of the lattice of all models in three variables and two lattices of 
all structure types of models in three and four variables, respectively. 
Chapter 14 considers algorithms for the generation of such lattices. 


7. MODELS WITH AND 
WITHOUT LOOPS 

Shannon's communication chain is the prototype of a structural 
model without loops. The output of one component is the input to the 
next. There is no feedback. Causality goes one way only. No component 
influences itself, directly or indirectly. Models without loops can be 
evaluated sequentially, and convenient algebraic (so-called closed form) 
expressions for computing the maximum entropy probabilities are 
made available in Chapters 12 and 14. For structural models with loops, 
algebraic expressions are unavailable and maximum entropy probabil- 
ities must be computed iteratively (see Chapter 12), requiring electronic 
computers. This difference motivates the distinction elaborated in this 
chapter. 

When models are simple, loops are easily recognizable by their circu- 
larity. But when models cover many variables and dependencies among 
them are complex, a visual inspection of block diagrams may be mis- 
leading, Consider the examples in Figure 26. Here mı clearly contains a 
loop involving A-B-CD and back to A, but the structure in the others 


might not be so transparent. 

l To detect whether a structura l 
algorithm similar to the one suggested by Bishop, 
(1978:76): 


1 model contains loops, we use an 
Fienberg, and Holland 


42 


Given the components Κι, Κ»,.... Ka, Κο... of a model 
(1) remove all variables that are unique to any Κε 


(2) remove any K. that is equal to or contained in any other Ky of the 
(remaining) set. 
Repeat | and 2 until either 
(a) no variables remain, in which case loops are absent, or 
(b) the remainder is unalterable by 1 or 2, in which case loops 
exist. 
Take m; of Figure 26 for the first example: 
Given: ABC: ACD:BCE 
byl: | ABC: AC:BC 
by 2: ABC 
by 1: $ 
Hence m> does not contain loops. 
Applied to m; of Figure 26: 


Given: AB:ACD:BCE:DE 
byl: AB:ACD:BCE:DE (no unique variable) 


by2: AB:ACD:BCE:DE (no K equal to or contained in 
another) 


Hence m; does contain loo 
and C-D-E-C 
Applied to πι, of Figure 26: 


Given: ABC:ABE:BCD 

byl:  ABC:AB:BC 

by 2: ABC 

by I: $ 

Hence m; does not contain loops, 


ps, namely A-B-E-D-A, A-B-C-A 


Actually, m; and πι, have the same Structure (one can be obtained from 
the other by exchanging labels), and the test is redundant. Their dif- 


ferent appearances demonstrate that block diagrams can hide the exis- 
tence of loops or false] 


y suggest them. The formal test is conclusive, 
however. 
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T E ... 15 merely intended to separate structures according 
litem Du d ampte requirements. However, the sequence 
P ora etermination that loops are absent is nothing but the 
antela a procedure by which such models could be con- 
adding : ending components to cover additional variables or by 
bi — pu with new variables to an existing model. The proba- 
c ui dew i " | by loopless models are computable in the same order. 
variables Sad PS the conclusion that loops exist point to a set of 
ασ in w ich some loop(s) make sequential computations 
, requiring iterative processes instead. 


8. INFORMATION IN MODELS 
AND IN DATA 


Westated that structural models that reproduce given data reasonably 


ms ceu as an explanation of those data. To assess their goodness 
ated by : seis are now needed that compare the artificial data gener- 
Soc ἕν lod el with the original data. In this chapter we will extend 
indt 5 initially bivariate notion of information and develop the 

umentarium needed to quantify how much of the information 


τ ; : 5 ; a x 
lls in data is represented in a particular model or ignored by it. 
se information quantities not only provide criteria for deciding how 
f alternative model 


on a model is but will also guide the exploration o 

ructures, 

"s begin with McGill (1954) and Ashby's (1 

41 2 and Weaver's (1949) notion of informa 
-12 is in fact composed of two separate entropies, 


Ming= A: B:C:...:Z, whose variables are all independe 
Z, both covering t 


969) generalization of 
tion, we observe that 
theentropy in model 
nt, and the entropy 
he same variables: 


T(A:B:C:...:Z) [8.1] 
_ + H(Z) - H(ABC...Z) 


= H(A) + (B) + H(C) + - 
=- H(A:B:C:...:2) —H(ABC...Z) 
Ξ H Cna) - H(m) 

z Τί) 


ginal data that 


the entropy in the ori 
ain within the 


H(m,) conforms to 3.5 and is 
del might hope to expl 


Contai " 
ntain all complexities a mo 
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variables covered. It is also the smallest entropy a model of these data 
can generate. H(mina) 15 the entropy in a model of independent variables 
that excludes all relations among variables the data could contain. Being 
computed as the sum ofthe entropies in the individual variables, by 4. lit 
is also the maximum entropy obtainable within the given set of variables 
and the entropy of the so-called maximum likelihood distribution to 
which only the knowledge of the distribution in individual variables 
enters (Gokhale and Kullback, 1978). The difference between the two, 
the total amount of information, T(ming), is the maximum amount of 
information in the data that a model can conceivably explain. T(A:B) in 
4.4 through 4.7 is its simplest form. l 

In developing the required information quantities, we proceed in two 
steps. The first is to generalize T(mina) in 8.1 to more complex models. 


Let ρω... be the probabilities in the data (or “generated” by πιο) and let 
Pabe...z be the probabilities generated b 


y a model m;. Then, by analogy 
to 4.7, 
Pabc..z 
ται) τ 2. H 2 Pabe..z log, T ere [8.2] 
8 b z Pabe..z 


T(m.) = 0 and for models mj with loops: 


T(m) # H(m) - H(m,) [8.3] 


Given that T(mis;) is the amount of information m, can and Ming Cannot 
explain, T(m;) must be interpreted as the amount of information in the 
data m, that escapes an account by model m;. It measures by how much 
mj is in error and indicates the quantity that structural modeling efforts 
aim to minimize. 


The second step toward the desired generalization concerns informa- 
tion measures that permit comparisons between descendent models. Let 
there be two such models, m; and m; of which mi generates distributions 
of probabilities O...; and 
Pix... The informational di 
tion modeled in m; 


[^] 
mts D oD ty log, —e z [8.4] 
a p z abc.z 2 p 
abc..z 
This most general definition of information lends itself to what is 
probably the most important identity for partitioning quantities of 
information. It is attributed to 


Gokhale and Kullback (1978): 


I(m, m.) = Im, > m,) + I(m; m.) [8.5] 
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tie ών decomposes the total amount of information in the data, 

ο... oe one quantity, I(mj—mins), for which the model m; 

= » an another quantity, I(m.~m), by which that model is in 
ror, the latter being due to the differences between the observed and 

the model-generated data. 

M dei a line of descendent models, 8.5 is extendable to any number of 
odels whose informational differences are related as follows: 


I(m, > m, 4) = K(m, > mj) + Im, mp + I(m > πια). [8.6] 


PAN μα equation enables the analyst to assess not only how 
È information a given model ignores and represents in its structure, 
respectively, but also by how much another model would improve its 
fit. ; We shall use both forms, 8.5 primarily for confirmation and 8.6 
| marily for exploration. For algebraic convenience, the informa- 
lonal difference in 8.4 may also be stated in T-terms: 


I(m, > my - I(m, πι) i I(m, > mj) = T(m,) - T(m) [87] 


I 
Muy Ming) 3 I(m, ὃς Mehain) * 1(Mhain y^ mina) [8.8] 


+ T(A:B) + T(B:C) + T(G: D) + T(Y 2) 
Whereas the sum T(A:B) + T(B:C) +--+ T(Y:Z) expresses the infor- 


Mation transmitted within the chain, the quantity T(AB:BC:...:YZ) 
tion transmitted between nonsuc- 


Summarizes all quantities of informa 
cessive pairs of ο ον example, T A: C), Τ(Α:2), T(B:Z)—and 
Contained in higher-order interactions which the chain cannot repre- 
Sent. If the reality the data represents is indeed chainlike, then the latter 
and, for this chain, extraneous quantities will be zero and the total 
amount of information equals the sum of the transmissions within 
Components as in 4. 13. If these extraneous quantities are non-zero, then 
ese error quantities suggest that the reality of the data is not quite as 
Chainlike as the model supposes- Equation 8.8 also demonstrates that 
the informational differences between models, expressed in I-measures, 
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ABCDEF 
T(m,)=0 m 


o; I K 
T(m,) I(m,> m) .4011 | bits 
T(m,) 
t 
ε 
m +. áo DE: 
Ting) ; Vm, > mna) 
l(m,- m.) 1.3843 | bits M 
jzm : | 
o, a 
š | a, 
Vm, mj) ten e 
aspc'oerf 
= πα Mt Addo 
Figure 27 
implicitly represent quantities of information transmitted between vari- 
abl 


es, expressed in T-measures, whereby the latter need not have the 
same covers. 


For a numerical example, consider 18 hypothetical observations "E 
Six binary variables A through F of two values, 0 or 1 each, as listed in 
Figure 28. The total amount of information in these data is 1.8301 bits. 
Of this, the model ABCD: CDEF accounts for 1.4290 bits, or 78%, and 
fails to account for the remaining .4011 bits. These quantities are found 
in Figure 27 as well. The model AC:BC:CD:DE:DF accounts for only 
«0446 bits, or 2% of the total amount, and fails to account for 1.7854 bits. 

i mpler model is totally inadequate. It 
ignores the information accounted for by higher-order interactions 
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Figure 28 
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w s : 
ο... later require an expression of the maximum value of the 
y I(mi—m;). Without justification, this maximum is 


7m m)* Σπ - max [R(k&L)] [8.9] 


Επι πι) 
Kem; j 


max 


br. iae s d of mi, Lisa component of mj, and K&L consists 
Suis four Wir sie by the two components K and L. 
Pe a a ^E ormation quantities so far considered pertain to 
dece ole. They differentiate neither the contributions made by 
‘lena p 7 parts (the expression in T-terms for the chain is an excep- 
irre r the contributions made by particular interactions. We shall 
s οτι τ of these issues in Chapter 14. Note also that we have not 
ee A how the probabilities generated by these models are com- 
ae : iens with available data; Chapter 12 concerns this issue. 
Mann y e problem of testing the significance of the information 
tities will be addressed in Chapters 10 and 11. 


9. STRUCTURAL ZEROS 


s may remain empty or have zero 


eras frequencies. Three reasons could account for such cells, the 
nia oe our primary concern. The first is empirical. Zero frequencies 
thi y be due to existing constraints in the data source, and evidence of 

1s nature may contribute to the research findings. Second, zero fre- 
quencies may be due to sample sizes that are too small to contain all 
Possible observations. Sampling theory contends that with increasing 
Sample sizes each possible observation will eventually occur at least 
once, no matter how rare the case may be. Significance tests aim at 
differentiating between the two kinds of reasons. The third reason is 


logical or theoretical. Frequencies may be zero because observations are 
ounds. Such cells are different from 


impossible or excluded on a priori gr: 
the other two in that expected probabilities are zero as well. A zero 
frequency therefore may have quite different interpretations. 

A zero in a cell for empirically possible observations (for which 
€Xpected probabilities are non-zero) is called an observational zero, 
Whereas a zero inan unoccupiable cellorin a cell for which observations 
are impossible (and expected probabilities are zero as well) is called a 
Structural zero. S=. 

Forexample, consider data on “who follows whom" in a study of turn 
taking during a group discussion. Given that no person can take theturn 
to speak away from himself or herself, all diagonal cells of the square 


In multivariate spaces some cell 
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To Whom: Higher Rank: 
o š 
1 2 3 4 5 
; b | 1 ag] Wes] Maa: ὃς | 
ας - 7 | 7533 | P24 | "25 
3 Lower 3 
Who: 
x Rank: τ a i 
s z T z 34| 35 
4 4 _ | - E | - nas 
Figure 29 


matrix in Figure 29 will contain structural zeros. A model generating 
expected probabilities with which observed frequencies are to be com- 
pared must not enter anything in these unoccupiable cells either. Or 
consider the cross-tabulation of messages exchanged between individ- 
uals of higher and lower ranks within a corporation that recognizes five 
levels of employment. Given that "higher rank" and “lower rank" are 
defined in terms of the difference in rank, only a triangle of the matrix in 
Figure 29 is Occupiable. The remaining cells contain structural zeros. 
Structural zeros do not need to be distributed as regularly as in Figure 
29. Further examples are found in Figure 32, 


Structural zeros destroy the strict Cartesian orthogonality of com- 
plete multivariate spaces t 


hat are formed by the simple product of their 
variables, and complicate 


computations by structural models (specifi- 
cally the generation of maximum entropy probability distributions and 


the evaluation of their degrees of freedom). In contrast, observational 
zeros do not contribute to the information measures and require no 
special considerations. 


10. DEGREES OF FREEDOM 


nsample size. Accordingly, and with N categories 
or cells, we can estimate only N 


ἘΠΕ timate - l probabilities, after which the 
probability in the N cell is no longer a matter of choice. Thus within a 
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variable or space K 
H(K) is pace K, the degree of freedom (df) for simple entropies 


df= Ng! [10.1] 


Str Η 
"c ολ δω impose the additional constraint that the com- 
tatis Abot a ies must conform to that model's parameters. In the3X5 
dogrees odes ue 30, for example, there are dfas- Nas 1715-1714 
E diuine om. The parameters of the model A:B with its two 
and di = N B er ον (variables) A and B have df, = Na-1=3-1= 2 
dfx.p = df,. M is ^ μὲν ο i Se of freedom, respectively, and 


For j 
models without structural zeros, 10.2 generalizes this notion to 


dt, = Dy df, [102] 


" > > > dfk eKp&K, 


e f>e g 
_ degrees of freedom in sets of variables 


shared among four components 


+ etc. 
ion quantities I(mi-mj), 


f informat 
ctural zeros, the degree of 


Fo i T" 
liteols: testing the significance © 
feat WO descendent models without stru 

tedom is 


- ΜΗ Aeon, δίς [10.3] 
In ΕΙ 

igure 30, ἀΓλῃ..α-5- ἄξλη - dfa:B= 14-67 8 degrees of freedom. The 
his condition. Because each model 


matri 
p labeled ABC A:B illustrates t a ο 
Satisfy its parameters, this maximum entropy distribution must 


me the requirement that rows and columns add to the marginal 
aral: abilities in A and in B. Of the 3x 5- 15 cells, the seven shaded cells 
implied or fixed once the eight unshaded cells are chosen, hence 

ΑΒ-.Α.ῃ = 8 as obtained above. For the five structure types possible 


Figure 30 


within three variables (see Figure 25), the degrees of freedom are listed in 
Figure 31. The entries in this table take advantage of the fact that in the 
absence of structural zeros, multivariate spaces are Cartesian products 
of its variables—that is, N asc... = NANaNc.... Because each model in this 
table is an immediate descendent of the one above, the column lists the 
degrees of freedom for the interactions eliminated during this descent. 
The individual degrees of freedom sum to the total dfn,—ming: The table 
should be familiar to x? users. 


Interaction 
Removed 


i mi dfn, mi mis, fois miu 


<ABC> (NA-D)(Ny-1) (N=) 
NAB*NAC*NpC-NA-Ng-Nc «BC»  (Ny-D(Né-1) 


2 AB:AC NAB*NAC- NA-! «Ας» (N,-D(NG-1) 
3 -AB:G NAg*Nc-2 <AB> NTD Ng D 
& ASBIG NA*NgtNe-3 
p Nc*? 
Figure 31 


P For models with structural zeros, degrees of freedom cannot be 
σης by unqualified multiplication. Cells with structural zeros or 
with unalterably fixed probabilities must be discounted not only numer- 


ically but also regarding their positions relative to each other. I propose 
the following three-step procedure: 
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(1) (a) Start with the original space ABC...Z, covering the same 

variables as the model m = Ki:K»:... 

(b) Assign zeros to all cells of ABC...Z with structural zeros and 
with a priori and fixed probabilities (neither of which is 
estimated) and assign ones to all other cells. 

(c) Obtain cell entries for each of m's parameters K by summing 
over the corresponding cell entries (zero or one) in ABC...Z. 

(d) Change to zero the entries in those cells of ABC...Z that 
participate in yielding the sum of unity in any cell of a 
parameter K. 

(e) Repeat c and d until the distribution of zeros and ones remains 
unchanged. 

(f) Determine whether the cells in ABC...Z with ones are sepa- 
rable into parts R of ABC...Z. Two submatrices or subspaces 
of a multivariate space are separable if they have no cate- 
gories or qualities in common. 

(g) For each part R separately, sum its entries to cells in each 
component K. Call the set of cells with non-zero entries 
in K: Kr. 

(2) Compute the degrees of freed 
but separately for each part 
degree of freedom dfm;-mj is the s 
obtained for each of its parts. 


(3) Given the above results for any pai 
be a descendent of the other, compu 
degrees of freedom by 


omaccording to 10.1, 10.2, and 10.3 
R and its corresponding Krs. The 
um of the degrees of freedom 


rof models, of which one must 
te the difference in the 


E E 10.4] 
-called noninteractive cells, whose prob- 


In essence, step | removes so tive 
d, and it distinguishes among sub- 


abilities are not free to be estimate ea 
matrices or subspaces whose degrees of freedom must be considere 

Separately. Step 2 computes the degree of freedom df,,,.-,, used in step 3 
to obtain df, Equation 10.4 is the general form of 10.3. We illustrate 


the HP. les from Figure 32 
FO) i mples fro: . . 
αμ ο fthe independence of A and Bisto 


Suppose t othesis A:B o 
be tented Bes s i zeros distributed as in the left of the three 
Matrices in Figure 32. Step lb assigns zeros to the nine cells with 
Structural zeros and ones to the seven occupiable cells. Summing these 
entries toward the margins, step 10 yields ones in cell 4of A and in cell 1 
of B. Step 1d then changes to zero the cells 11 and 44 of AB which are 
responsible for the ones in A and in B. Step 1c then finds ones in lofA 
and 4 of B, causingstep 1d to change 12 and 34 of AB to zero, and so on 
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Figure 32 


until, as it turns out in this case, all cells are zero. As there obviously is a 
sequence for computing all probabilities in AB from those in A and B, 
which this procedure actually traces, there are no options, no degrees of 
freedom, and there is in fact no information in the data that the model 
does not already contain in its parameters. The set of values to be 
estimated being empty, the degrees of freedom is zero. 

Suppose the null hypothesis C:D is to be tested with structural zeros 
as distributed in the second matrix of Figure 32. Here step Ic finds a one 
in cell | of D causing step Id to change cell 11 of CD to zero, then step 
lc finds a one in cell 1 of C, causing step Id to change cell 12 of CD to 
zero. At this point the iteration stops. The remaining six ones remain in 
place. Because the matrix is not separable, we compute the degrees of 
freedom from this remaining set as a whole. Accordingto step 2, we find 
dícp -6-1-5,dfc- díy-3-1-2, dfcp-c:p = 5-2-2- 1, and, indeed, 
only one of the six probabilities in this matrix needs to be estimated for 
all the other probabilities to become known. 

When the third of these matrices is subjected to the same test, the 
configuration of zeros and ones assigned by step 1b remains unalterable 
by Ic and by 1d. However, the matrix is clearly separable into two 2 X 2 
submatrices. To obtain the degrees of freedom for the whole matrix, we 
take the two separable parts individually. Each 2 X 2 matrix contributes 


one degree of freedom, which brings the degree of freedom dfgp. p.p in 
the matrix as a whole to one. 
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In the last example of Fi 
. structural iae. odis i ee μια ρος cm pos 
E ας i :H men of the seven possible cells is removable by τ L 
oxy a, ἡ ssi e, hence R consists of the seven cells. Then according 
it. i E p ds 76, dfc = dfy = dfy = 2 — 1Ξ 1, and dfgyy—G:u:3 = 7 - 
EE A AN e pais e model GH: J we find only one cell, cell 111, to 
^ iir s values, the other six remain unaltered by step |. 
id yields dfow = 6 - 1 = 5, donc 3.172, ιν ο ame 
rr =5-2-1=2. Applyingstep3, dígu-c:87 4-272. For the 
Weng GH:GJ:HJ, starting with GH, cell 111 is removed as before. 
ο with GJ, cells 211 and 221 are removed, leaving the 
aining four cells to be uniquely determinable by values in HJ, hence 
equi τν = 0. This suggests that the one structural zero here 

sae t ird-order interactions from the data. 

ative procedures are described by Bishop et al. (1978:115ff). 


11. THE SIGNIFICANCE OF 
INFORMATION QUANTITIES 


e e e quantities express differences between two distribu- 
dann frequencies or probabilities. When the sample size of these 
ATO 15 small, sampling biases may add to these differences. It 
yere ν that information quantities obtained from a sample tend to 
that - imate the true information quantities In à population from which 
stati ample was drawn and are rarely exactly zero even when there is no 
1 istical difference between the two distributions. This led Miller 
(1955) to propose correction formulas that need not concern u$ here. 


What we need to decide is whether an information quantity as measured 
le size leads us to expect. 


exceeds the sampling error that the samp Is ; à 

Miller and Meadow (1954) have shown that quantities of information 

E the familiar x? (chi-square) values are similar in distribution. The 
aximum likelihood estimate L° 


P = Bnd pin, = = 3869 03 pi 


approximates x^ asymptotically, and the approximation becomes the 
better the smaller the information quantities are. Thus with the help of 
the L? values and the appropriate degrees of freedom, the probability 
(significance level) that the information quantities reflect sampling 
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2 
biases rather than true differences can be obtained using standard x 
tables. 

For example, in Chapter 4 and for data in Figure 13, we found the 
amount of information retained between two interviews to be T(A:B) - 
1.054 bits. The degrees of freedom in this 4 X 4 table is dfAp..A.p = 9. 
And with n = 266, L? becomes 388.67. According to any x’ table, for 9 
degrees of freedom we require a x? of at least 27.88 in order to reject the 
null hypothesis at the .001 level of significance. Given that L^ here 
exceeds the required χ᾽ value, the information quantity is statistically 
significant at this level. Or consider the data on racial biases in Figure 16 
as analyzed in Figure 18. For the third school, T(3: B) = .187 bits, dfg = 3, 
n = 272, L? = 70.51 exceeds the x^ = 16.7 required to reject the null 
hypothesis at the .001 level of significance. Thus there is little doubt that 
racial considerations matter in this school. For the fourth school, 
T(4:B) = .00013 bits, dfg = 3, n= 140, L? = 03 does not come near the χ᾽ of 
any reasonable level of significance. Hence bias cannot be alleged here. 

Note that x° tests are limited to distributions whose average cell 
frequency is at least five, n» 5N. Although the violation of this restriction 
biasesthe information quantities as well, as the emphasis in this test is on 
the null hypothesis (of no true differences between two frequency distri- 
butions), an overestimation of information quantities feeds the Type I 
error of rejecting the null hypothesis when it should be accepted. 
However, because of the particular organization of structural models 
(see lattices in Figures 24 and 25), the models that do survive this test 
tend to be more complex than actually needed and are likely to include 
the true model (which would have been found had more data been 
available) as one of its descendents. Unlike the x’ test, in the context of 
this modeling approach, inadequate samples render the L? test not 
inappropriate but merely more conservative. The L? tests says little 
about the complementary error of accepting positive quantities of 
information as true quantities when they might be affected by inade- 
quate sample sizes. For further comparisons see Chapter 15. 

PHOT really small sample sizes (relative to the number of cells available 
in a multivariate space) we refer to "bootstrap techniques" outlined by 


Diaconis and Efron (1983), which provide reasonable estimates of the 
reliabilities of the models inferred from data. 


12. MAXIMUM ENTROPY COMPUTATIONS 


Structural models compute a probability distribution that satisfies 
their parameters and is otherwise maximum in entropy. For models with- 
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out i 

ο ' del cans structural zeros, all relevant entropies can be 

a Lien y from the entropies in a model’s parameters with- 

i n icitly to generate maximum entropy probability distri- 

in erue atter is required for models with loops and for many cases 
structural zeros are present. We will start with the former and 


then proceed to the general case. 


M " 
odels Without Loops and Without Structural Zeros 


in Eos Puer pa shortcuts available for loopless models are rooted 
eiii earn at the probabilities such models generate are simple 
ality reflect Ἢ probabilities in a model's parameters whose condition- 
icf s the way components are connected. For example, the 

um entropy probabilities in the model of independent variables, 


A:B:C:...:Z are 
Trese  PaP Pe ++ ras 
I " 
n the chain AB:BC:...: YZ, similar to Figure 9, they are 
Pap Poo Pea ° ° ° P. 
w = _ tab*bc cd yz 
abc..z Pap P. v Pa se Pziy 7 
ab Fe|b t dle z Pp Po os Py 


and in the model m; = ABC: ACD:BCE of Figure 26 they are 


p E = Pabe Paca Proce 
abede ^ Pabc Palac Pelbe 
Pac Poe 


uct equals the sum of the logarithms 


robabilities become the sum of the 


Given that the logarithm of a prod 
n the variables shared 


of each part, the entropies of these p. 


entropies in each component, conditional o 
among them: 
H(A:B:C:...:Z) = H(A) ΗΘ) * HO) * . ..* HC) 
H(AB:BC:...:YZ) = H(AB) + Hg(O *---* H, (Z) 
(ABC) + H, C (D) τ Hyc (E) 


H(ABC: ACD: BCE) = H 


he above, the maximum entropy in à model 


Generalizing from t 
ut structural zeros is 


Without loops and witho 
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HK, :K,:K,:...)2 2; H(K) [12.1] 


- DV aK, & K) 
e f>e 
EU yn H(K, &K, & K.) 


e f>eg>f 


_ All entropies in variables shared 
among four components 


+ etc. 


In words, it is the sum of the entropies in their components minus the sum 
of the entropies in variables shared by pairs of components, plus the sum 
of the entropies in variables shared by any three components, minus... 
and so on until no shared variables remain. So for ABC: ACD: BCE, 


H(ABC : ACD: BCE) = H(ABC) + H(ACD) + H(BCE) 
-H(AC)-H(BO) -H(C) 
* H(C) 
= H(ABC) + H,c(E) + H,c(D) 
= H(ACD) + H, (8) + Hyc(E) 
= H(BCE) + H,c(A) + H, (D) 


The first of these identities illustrates the entropy computation by 12.1; 
the last three reflect the orders in which components can be assembled 
sequentially. For comparing models without loops and without struc- 


tural zeros, information quantities can also be expressed as mere entropy 
differences: 


I(m, > m) - H(m,) το H(m,) [12.2] 


Although the transmission m 
nally developed for models 
covers (McGill, 1954; Ashb 
12.1 and 12.2 point to th 
informational differences 
models depicte 


easures of information theory were origi- 
of independent variables and different 
y, 1969), all of which are naturally loopless, 
€ possibility of finding T-measures for the 
iffer between descendent models. Consider the 
din Figure 24 for examples. The closest common ancestor 
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of m; j 

τ. ox mj has aloop and cannot be considered here, but the remaining 

id es in ricus τῷ loopless. Illustrating the above, the amount of 
ion (πε mim; impli i 

Um reps j m;) can be simplified using 12.1, 12.2, and 4.4 for 


H(A: BD: CD: CE) = H(A) + H(BD) + H(CD) + H(CE) 
- H(D) - H(C) 

—H(ABD:CD:CE) = - H(ABD)- H(CD) - H(CE) 
* H(D) * H(C) 


YABD: 

(ABD: CD:CE > A:BD:CD:CE) = H(A) + H(BD) - H(ABD) 
= (ABD >A: BD) = T(A: BD) 

less models are further 


2m algebraic properties of information in loop 
eveloped in Chapter 13. 


Models With Loops or With Structural Zeros 


The very nature of loops is that the components involved ultimately 


p themselves. Loops have neither beginning nor end. The distri- 
ution of probabilities generated by models with loops must reflect this 
crucial circularity. Entropies cannot be obtained by (closed form) alge- 


braic expressions that imply a linear order of computation. Take the 
e first component ABtothe 


M AB:BC:ACforexample. Applying the firs 3 
REG rved probabilities pa, we compute probabilities p» and, applying the 
ond component BC to these, we find Ρε. But then, applying the third 
component AC, which closes the circle, We obtain values for ps that may 
Not be the same as those with which we started, requiring revisions, 
Tevisions of revisions, and so ΟΠ. To take appropriate account of this 
circularity, the computation has to proceed as indicated by such a model 
and go around and around its loops until the distribution achieves 
equilibrium (i.e., vc. = aac...) and is maximum in entropy. The 
"erative algorithm described below does just this. - t 

It happens that structural zeros may make similar computational 
demands. For example, in Figure 29 we find a matrix with structural 
Zeros in the diagonal. Had all cells of this matrix been occupiable, the 
Probabilities expected under the null hypothesis of independence would 
lave been zap = Papo. However, this expression assigns non-zero expecta- 
tions also to the diagonal, yields an entropy that exceeds the maximum 


obtainable within that matrix, and is thus misleading. Suppose, then, we 
and adjust the expected probabilities 


acknowledge the structural zeros 
in the non-zero cells by z“ = PsPo(! - ps). Although the columns now add 
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up, Σπ = ps, the rows do not, Em's Æ Ρ.. Attempting to adjust Tab 
further so that also the row sums conform to the required marginal 
values now disturbs the column sums, and so on, again forcing the 
computation into a seemingly unending cycle of revisions similar to 
models with loops. 

Although there are several cases of matrices with structural zeros that 
can be evaluated by conventional algebraic methods (see Bishop et al., 
1978, and the triangular matrix in Figure 29 for examples), the steps 
involved are often so cumbersome that we suggest using the iterative 
algorithm in all of these cases. 

Westate the iterative algorithm, originally proposed by Demming and 
Stephan and generalized by Darroch and Ratcliff (1972), in these terms: 


Given a model K SK, £a K, with r components K.. 
Let Pave... be the observed probabilities in the space ABC . , . the model 
Covers. 
Let pk. be the probabilities in the et" 
e 


summing over the values k 
in Kj: 


component K,, obtained by 
B eK, of K,’s complement (variables not 


Pk. Ξ 2 Pate... 
e 
Let No be the number of Structural zi 


eros, N, be the number of fixed 
probabilities and let v be the sum 


of the fixed probabilities. 


Set the N, cells with structural zeros to: «XQ =0 
C... 
Set the N, cells with fixed Probabilities to: «(ο — 
abc... Pate... 
ini -NO— har of (Oysa Y. = 
των; 
N,-Nj) 
For iterations: t=0,1,2,... 


For components: Ka B2 Ye ck yE 


For cells whose Νερο. — Νο - Νε cells are to be computed 


(rt*e) _ p. 
=p πες — em 
ir Bg 
spon ° > qette-n 
LA x 


Stop when a suitable level of approximation is reached, 


59 


In wo š 
space w == by a model AB:BC:AC in, say, a 2X4X3 
model, the e a zeros, we start with the parameters of this 
Values In C. mA ^ th Pab, Ρις» and Pa, obtained by summing over 
tHiust be me one i nd in B, respectively. These marginal probabilities 
παπα το. we seek to generate. As there are 
ο 1354564 m si entries, we initialize ax = 1/24 in 


We ; 
obtain the marginal probabilities in AB by: 
(3t) — (3t) 
Sas > abe 


Consideri 
sidering that Wap should equal p,,, We adjust: 


G0 
(3t+1) = abe 
abc Pab wo 

ab 


We ο 5 
btain the marginal probabilities in BC by: 


(8111) = (3tt1) 
g S > Mave 


bc 
Consideri 
Onsidering that coy, should equal py, : WŠ adjust: 
tH) 
(8112) = abc 
abe Poe Gt) 
be 
in AC by: 


We obtain the marginal probabilities 
(3tt2) 
etie 22 abe 
b 


Considering that ο should equal Pac? we adjust: 
(3tt2) 
+3) = abc 
ωῶς Pac 0812) 
ac 
d continue until the computed probabilities co, 
imates the observed probabilities py. E 
e 


i 1 
We increment thy i3 approx 
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For models with simple loops and only a few structural zeros or fixed 
probabilities, a reasonable approximation to the maximum entropy 
distribution is found in five to eight iterations, after which probabilities 
are generally accurate in the first three digits. When models contain 
neither loops nor structural zeros, the algorithm achieves a perfect fit 
after the first iteration. With the availability of computers, the iterative 
algorithm therefore may be used to compute the maximum entropy 
probabilities for all models. A test for having reached a reasonably close 
approximation then obviates the test for the presence of loops presented 
in Chapter 7. Although algebraic expressions are undoubtedly useful 
conceptually and advantageous computationally, the iterative algorithm 


is entirely general and limited only by the computability of the space 
ABC...Z, specifically by its size Nanc...z. 


13. CONFIRMATION 


Inaconfirmatory mode of analysis, we start with a structural model, 


then test how well that model fits the given data, and, finally, 


we analyze 
the details of the model's fit to direct the interpretation of findings. In 


contrast, in an exploratory mode, we specify at most the properties of 
the class of models to be considered and leave the search foranoptimum 
model (in a sense to be delineated) to a procedure, usually in the form of 
acomputer algorithm. The confirmatory approach is particularly appro- 
priate when the analysis is guided by theoretical considerations—for 
example, when the validity of a particular theory is at stake or when 
Some patterns of explanation are preferable to others. 

We will elaborate here several analytical devices applicable to struc- 
tural models generally and leave the search algorithms for Chapter 14. 


The Goodness of Fit of a Model 


The goodness of fit reduces to testingthe significance of the difference 
between the original data in the saturated model m, and the distribution 
generated by a model m; that conforms to the data only in its parameters 
and is maximum in entropy otherwise (see Figure 19). According to 8.4, 


in which m, is represented by the observed probabilities Pate...2 and mi by 
the generated probabil 


ted ities pa... the amount of information by which 
the model is in error is 


P, 
I(m, >m) ο ni log, abc.. 


abc.. 
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A zero val hi d rfect fi 
ue of this quantity indicates a No M 
" ores y indic: tes erfect fit. E 
bed for-thetr at κ "i = y. P n-zero values are 


Le = 
ism, 7 1-3863 n I(m, m) 


ο σα 


in whi š 
PARE irs the sample size. Together with the appropriate degrees of 
the sien a sl as discussed in Chapter 10, and an ordinary x’ table, 
Mc model e level is determined as in Chapter 11. In this test the 
dish Por asa structural null hypothesis. If the quantity by 
adean x p isinerror is significant, then the model must be rejected 
"e quate.. fthis quantity is insignificant, the model may be accepted 
xplanation of the data, its significance level indicating the proba- 


bili I 
ility of being wrong in this decision. 


T 
he Amount of Information Modeled 


Thi ; 
his assesses the difference between the di: 


mo i FER 5 
del in hand and the distribution that wou 


model’ A E: a 
del's variables were unrelated or independent. With the observed 
65 Datc...z generated by mj as in 


ΗΝ Pate...z in Mo, the probabiliti 
sie above, but with the probability πως... associated with the model mina, 
amount of information modeled is 


stribution generated by the 
ld be expected if all ofthe 


Pabe.. 


eos 


I(m > mio) =2 Pac. 82 Ç i 
abc.. 


ect this quantity to be 
erves as the structural 


F 
or a model to be a reasonably good one We exp 
ffort has merit at all. 


1 ΜΗ Š ἤ ] 
SO. If it is not, a significance test in which Mina 5 
m Pohs will reveal whether the modeling € 
he two information quantities are related by 8.5: 


I(m, ^ Mina) 7 I(m, > mj) f I(m; > Ming) 
uctive expressions, indicating the proportion of 
account for, respectively: 


Which suggests two instr 
Jains or fails to 


information a model exp 
Proportion of unexplained information Im, > mjy/Im, > Ming) 
Proportion of explained information I(mj > Ming )/l(m, > Mina) 


elevision and aggres- 


= EAE’:EAA’ for the ti 
luated as follows: 


model mj š 
icted in Figure 35,iseva 


For example, the 
5, also dep! 


Sion data in Figure 


62 


Information Ignored Information Modeled 
I(m,—m;) = .0080 bits Km; mina) = .6229 bits 
L’ = 3.35 L? = 262.53 
dí-4 dí-7 
significance - no significance = .0001 level 
unexplained = 1.27% explained = 98.73% 


These rather unambiguous findings suggest that it would be a mistake to 
dismiss the model relating prior TV violence exposure, E, and prior 
aggressive behavior, A, to subsequent TV violence exposure, E’, on the 
one hand and to subsequent aggressive behavior, A’, on the other. 
However, the confidence this test establishes refers only to the absence 
of modeling errors. Neither the test nor any of the measures employed 
will indicate whether the model is structurally the most economical one, 
an issue discussed in Chapter 14. Moreover, these measures apply only 


to a model as a whole and are not indicative of its individual parts, to 
which we will now turn. 


The Complexity of a Model’s Components 


This should indicate how 
building a model’s component 
a verbal hypothesis or embod 
tribute to the modeling effort t 


much is involved in describing or In 

s. Whether a component operationalizes 

ies a complex function, in order to con- 

hat component must recognize or draw ἃ 

finite number of. distinctions. The larger this number is, the more variety 

it can store, transmit, or share and the more difficult it will therefore be 
to describ: 


€ or materially represent that component. The required 
number of cells or states ranges between the following extremes: 


H(K) | < required number 13.1] 
12 [« of states or cells ^K 


where Nk is the number of occu 


piable cells in K, excluding structural 
zeros, H(K) is the entropyin K, a 


nd the inverted brackets denote that the 
enclosed expression is to be rounded to the nearest larger integer. log 
of the three expressions would give the complexity of a component 
in bits. 


The Contributions a Component Makes 
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TOC " 
αέρα d the component in question. All of these contributions are 
πο A, i-um. nur interactions from a component and 

The int ein ormational difference this makes. 
wein min that need to be removed from a component K are 
λίαν ε s Κριι of this component's variables into sets of 
AB and we always cooccur in the model. For example, in Figure 33, 
andes ο occur together in Ki: Ko: Ks = ABCDE:CDEF:EFG. 
Soc inem g Ki by this rule yields Kipar = AB:CD:E in which AB are 
Gis uni eat to Ki and partitioning K; yields Κάρι E:F:Gin which 
«EG» = = a Ks. The latter omits the interactions <EF>, <EFG>, 
being ih >, of which <EF> is shared by K> and Ks, the other three 
ique to K3. 
or total amount of information pr 
ES arai difference between the w 
Bing ως makes no reference to the context o 
"rapi partitioned component consists of inde 
press this quantity in two ways: 


j^ T(K part) 
processes I(ABCDE~AB:CD:E) = 


ocessed by a component is the 
hole and the partitioned com- 
f that component. Given 
pendent variables, we 


I(K > K part [13.2] 
In the ῃ 
preceding example, Ki 
T(AB:CD:E) bits. 
fe The unique contribution by a component is the informational dif- 
Mes between the model m that contains the component K whole and 
e model πι, Κρις that contains Κροπ in K's place: 

I(m > m, Ky set) [13.3] 
ue variables are shared with other 
nce redundant inm,Kpar, the model 
jables only and is simply omitted if 
of the model in Figure 33 
tains none. Their 


= Linie because all but K's uniq 
Ed a ss in the model m and are he 
ü »Kpan represents K by its unique vari 
9 such variables exist. The component Ki 
Contains the unique variable AB, whereas Κα con 
unique contributions are, respectively, 
- = I(ABCDE: CDEF :EFG + AB:CDEF: EFG) 


I(m > m,K,, 
= I(ABCDE :CDEF : EF 


(πι ^ m, MENU G > ABCDE: EFG) 
ion, the variables shared with the 
ntrolled, averaged over all of their 
ing that measure, whereas in the 
ent, variables are 


nique contribut 
model are co 


ted from enter 
rocessed by a compon 


In a component's u 
Other components ofthat 
values and thereby preven 
total amount of information P 
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ABCDE : CDEF : EFG 
Figure 33 


; ] o ed 

not controlled and may hence contain Shared, or what is sometime: 

called spurious, quantities. I 
The amount of information shared between onecomponent K and al 


other components of the model m is the difference between K's total 
quantity and its unique contribution: 


4 
IR> K par) - m mK,) [134] 


In models without loops, these contributions can be expressed by 
T-measures. Figure 34 differentiates the contributions for the model in 


Figure 33. For example, the amount of information T(E:F) that 
responds to the interaction 


Shared quantity, 
and contributes t. 
be similarly han 
tities respond on! 
a model withou 
Only when shai 


by individual components add to the total amount 
This sum is uninterpretable otherwise, 


a model processes. 


The Strength of Relations (Association) 


The strength of relations withi 
the above. Association is stronge 


component are perfectly predicta 


n a model’s components follows from 
St when (a) all variables attached to a 
ble (determined) from each other and 


K, $ K, : K; 
Model: ABCDE : CDEF : EFG 
Unique Contributions 
(mm, Ko) T(AB: CDE) T(CDE: EF) T(EF: G) 
Shared Contributions 
IK Κερ ZIM > m, Kart) T(CD:E) T(CD:E)* TUE: F) TU: 
dps. ERK À T(AB:CD:E) TCD: E:F) T(F:F:G) 


Figure 34 
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v. πα ος realizing this relation is essential in the context of all 
ersten cai of that model. The bivariate index of predictability 
Talose a eform of4.9 responds to condition a andis a qualitative 
Wens 3 e path coefficient or the squared correlation coefficient. 
He ee o generalize this index to any structure and hence to numbers 
ns larger than two, thus making it responsive to condition b. 
ir nih s of a component's unique contribution is sensitive to the 
oppo Pr in which that component participates and serves 
μπει of LU ute measure of association among its variables. The upper 
TAN 15 quantity is found with 8.9, and the proportion of the two 
ltities indicates the extent to which a component's behavior is 
predictable or determinate (as opposed to governed by random pro- 
cesses). We propose the following relative measure of association as a 


generalization of 4.9: 
I(m >m, Κορε) < 


0<t d 1 κ τικ δα 
K, m m,K, 
i IQ a part) max 


[13.5] 


nt K in the context of model m”) 


ons within K. These associations 
he model m. The 


me (read the subscript as “compone 
Icates the strength of the associati 
are unique to K, not shared with other components of t 
index is zero when the variables separated in Kpan are all independent in 
K, in which case K is a totally fictitious component of m and may be 
omitted without loss. The index is unity when the variables in Kpan are 
Within the confines of the model's parameters maximally constrained, in 
which case K embodies a many-to-one if not a one-to-one-to-one . - - 
Telation (the qualitative analogue of perfect “multicollinearity” and the 
multivariate version of perfect correlation). κ 
Applied to the television and aggression data in Figure 5, the unique 
Contribution ofthe component EAE'ofthe model ΕΑΕΊΕΑΑ' in Figure 
35 that attempts to explain TV exposure to violence 15 
I(m > m, Ky) 7 I(EAE': EAA’ > E': EAA’) = .2032 
he .0001 level. 


With L? = 85.6 and df= 3 this contribution is significant at t 


Its maximum is obtained by 8.9: 
,) * H(EAE) — max [H(E’), H(EA)] 


= I(m >m, K pa 
+2.7922 —max[.9992, 1.9962] 


= .2032 


Im>m,K |) 
part^max 


.9992 


ή 
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Thus the association coefficient becomes tear, EAE:EAA' = er ely 
«2034. It suggests that the association. in EAE', though ^s pure 
significant, has only 20% of the strength it could have within m c ος 
of this model. This value is found in the Num ES " igu > 

i so depicts the model ΕΕ ΕΑ’: AA’ for comparison. . 
πα, to the latter model in this figure, exposure to a 
TV programming and aggressive behavior turns out to be remarka icd 
stable over time, with television exerting only a small ο... sn 
aggressive behavior. In assessing such associations it is importan is 
note that the choice of a model is crucial because it specifies the contro 
to which measures of a component's strength respond. Associations like 
those discussed could be spurious, an issue we will now address. 


The Amount of Interaction 


This is the informational difference between two models that differ 
only by the interaction to be assessed, one being the immediate descen- 
dent of the other. For example, the model ABC: ABD: CD includes the 
interaction <CD>, whereas the model ABC: ABD does not, the latter 


being an immediate descendent of the former. The model ABC: ABD 
includes the interaction CAB 


C>, whereas AC: BC: ABD does not, both 
being exactly one generation apart. With this understanding the amount 
of interaction is merely notational. Let m<K> be a model that excludes 
the interaction <K> and let M<K> be the immediate ancestor of 
m<K> now including <K>. The amount of information associated 
with the interaction <K> then becomes 


I(m<K>> m<K >) [13.6] 


With reference to the model m,, 
opposed to the spurious) amou 
amount of information in intera 
controlled for, averaged, or 
measure. References to mode 


this quantity is called the genuine (as 
nt of interaction for it expresses the 
ction <K> with all variables not in K 
Prevented from contributing to this 
ls other than m, omit some of these 


i 55 foort Ë 1943 Toi Ë 
> i 
an 0: ë 
Α΄ / 
< 
3 = " 
0001 A A 4036 9061 


AEA' ALE’ AA'LEA' EE" 
Figure 35 
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controls. Fi ine i 
igure 36 shows genuine interactions of different ordinality 


a Ta 
En controls within five variables 
ontinui è isi 
uing the example of the television and aggression data, the 


extent o i š 

E causal relations among television exposure, 
: on, subsequent aggression— i ine 
Re el NU ae ggression—that is, the amount of genuine 


I(EAE': EAA': EEA’: AE'A' > EAE’: EE'A': AE'A) = .0008 


and is n joni 
Balances significant. The extent to which exposure to television 
auses aggression—that is, the amount of genuine interaction 


in <EA’>—is 
K(EAE': EA’: AEA’ > EAE! : AE'A") 7.0058 


ot significant. However, 


It is la: 
rger than the triple interaction but still n 
the amount of genuine 


the ili 
int stability of aggressive behavior—that is 
€ractionin<AA’>: ᾿ 


KEAE' : AA': EE'A' > EAE’: EE'A') = 3869 


2 
Saath 163.1, df = 1, and is significant at the 0001 level. This 
Cube ncannot be dismissed. Thus interaction effects can beisolated 
iss sured with the strongest controls to which given data lend 
elves. As we said, weaker controls are possible in the context of 


models other than mo. 


Strata Within Models 


p S rotural models may be exa 
Seba i can ignore some of t 
tion by summing and then tes 


yet another perspective. 
f a multivariate dis- 
aller covers on the 
ality of the data 


mined from 
he variables Οἱ 
t models with sm: 
ve the dimension 


Whol 
fata le sample. But one can also lea 
ct and examine how well a model fits within a particular subsample 
<K> 
Ordinati Controlled. 
᾽ <ABCDE> ABCDE = m, ABCD ABCE:ABDE:ACDE:BCDE| none 
f <ABCD> | ABCD: ABCE: ABD! ACDE: BCDE ABCE : ABDE: ACDE : BCDE E 
5 «ABC» ABC: ABDE: ACDI BCDE ABDE : ACDE : BCDE DE 
P» «AB» AB: ACDE: BCDE ACDE: BCDE CDE 


Figure 36 
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of these data. When such subsamples are defined in a model's own 
terms, one evaluates strata within a space. 

For example, in a model that suggests a certain variable to be the 
input or the controlling variable of the modeled process, one may want 
to test the extent to which the behavior of the model associated with 
each input value conforms to or deviates from the structure summarily 
represented by the model of all data. If the input to such a model is an 
on-off switch, causing a structure in the on position and independence in 
the alternative, the model would presumably be confirmed in the on 
position only, data on the off position then merely add noise. Or 
consider a complex model of social mobility, including categories of 
religious affiliations. It is quite possible that each religious group 
conforms to a variation of the general model and that a separate 
assessment of their conformity to the descendents of this general model 
might provide additional insights about the differences among these 
groups. 

Information measures aiding the examination of strata take advan- 
tage of the fact that information quantities are averages of log-likelihood 
Tatios as in 4.7 and that such averages may also be obtained for any 
subspace of a multivariate space characterized by particular values in its 


variables or in the parameters of a model. We generalize the informa- 
tional bias 5.2 to complex models: 


1(m,> mj) -Σ», I (m; m) [13.7] 
s 
and 
1 [^ 
L(m;? m) = + Palo, us [13.8] 


η qenotss one category or value of a subspace S of ABC...Z 
m Nm ως mj, 5 is a category or values of 53 
bilities ws = c, +Z, probabilities ps = pate... are observed, proba- 
neces 5 nig are generated by mi, and probabilities ps = Dade. are 

y mi. Equation 13.7 partitions an information quantity into 


the weighted sum of the į : 
€ information iti ; ith each 
stratum s, and 13.8 show quantities associated with 


"im 5 the lat Healt tio 

within that stratum, ter as the average log-likelihood τᾶ 
We fi ; Š 
mili notational conventions of 5.2 according to which the 
ences to iim 's indicated by subscript in the above, replaces refer- 
variables in the model designations, now held constant: 
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I (AB > A: B) = I(aB >a: B) = T(a: B) 
1 P, 
=— ] 2224 
DID 8, Tab 


al 


: : 

—— case the degrees of freedom is not dfap—a:B but dfg. Using 

sper i :CD and AB:C:D as examples, we can examine any slice 

for plane) in the multivariate space by holding one value constant— 
example, in the unique variable A: 


L (ABC: BD: CD > AB:C:D) = I(aBC: BD: CD + aB: C: D) 


for whi 

A the degree of freedom becomes dfgc:pp:cD—B:C:D- We can 

vari ine any parameter of a model (subspace or cylinder of the multi- 
ate space) by holding one of its values constant—for example, in the 


s 
cond component: 


I, (ABC: BD: CD > AB: C:D) = I(AbC:bd:Cé > Ab: 6:2) 


1 Mabed 
NEM T. 
Ppa 35 abcd 
. And in the extreme 


ces to dfAc—A:C 


e model (a cell in the multivariate 


w 

i-es the degree of freedom redu 
€, we can examine any state of th 

Space): 

"aca (ABC : BD:CD- AB:C:D) = I(abc : bd: cd —ab:c:d) 


er is no longer an 


however. The latt 
t cells. We exempli- 


1 . 
‘Osing all degrees of freedom, i 
dentify devian! 


in mation measure proper but willi 
its use in Figure 15. δ 
The reduction in the number of degrees of freedom points to the fact 
that strata are not capable of recognizing the highest-order interaction 
1η the original data. Tests on strata of models are particularly useful in 
conjunction with forms like 8.5, which partitions the total amount of 
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JE 
i i nted by that mode 
i tion into the amount omitted and represe: τ m 
Hue 13.7 points to the possibility of aggregating the indivi 
strata similar to the informational bias in Figure 17. 


14. EXPLORATION 


t 
In an exploratory mode of analysis we search among = ce 
possess certain specified structural properties to find set ana 
optimal balance between the two conflicting criteria of simp μάς 
goodness of fit. Inasmuch as the search presupposes little about t ls We 
under consideration, exploration may lead to unanticipated results. 
will illustrate the process by means of several algorithms. 


Searching for the Ordinalities of Appropriate Models 
Data may vary great] 
analysis must have the 
potentially important pati 
for the ordinality of the i 


y in complexity. Appropriate techniques ple 
capacity to respond to their complexity s 
terns may never be discovered. Here we searc 
nteractions manifest in multivariate data. 5 
Asa digression, we note that most of the familiar statistical techniqu 


i ial sei à : of 
in the social sciences Tespond to binary relations only and are then 
ordinality two: correlations betw 

distances b 


š and 
» Comparisons between two systems, 
so on. Such techniques are com 


Teasons to suspect that hi 


appropriate measur ible 
between the origin model consisting of all poss! 
binary components AB:AC: AD:...: YZ. It assesses the interaction © 
ordinality larger than two, the 

Using the Superscript w to denote the common ordinality of bet 
components of a model, starting with m" = mo, where W is the num e 
mo and also the largest ordinality these dator 
contain, and ending with the model consisting of W indepen 
variables, m! = Mina, Wi 


"n a ‘on in the 
MET e partition the total amount of information in 
data I(m"—m )* (mmn 4) by 
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w = 
I(mV 2 mV!) = amount of W-th order interaction [14.1] 


amount of W-1-th order interaction 


Im"^! E πι 2) 


I(m" mV!) = amount of w-th order interaction 
3 ` ᾷ £ 
I(m ^ m?) = amount of third-order interaction 
I(m? >m’) = amount of second-order interaction 
ing D) of data 


For example, forthe five variables A, B, C, E, and F (omitt 
in Figure 28, we obtain the following account: 


I(m$—m^) = 0000 
I(m^-m?) = 9780 
I(m?-m?) = 0000 


Im?-m!)- «0743 
-—— mm "n 
I(m5—m!) = 1.0523 bits 


Where: 


m? = ABCEF 
m* = ABCE: ABCF : ΑΒΕΕ: ACEF : BCEF 
m?- ABC: ABE-ABF:ACE:ACF: AEN: BCE: BCF: BEP: CUP 


m? = AB: AC: AE: AF :BC:BE: BF:CE:CF:BF 


m! = A:B:C:E:F 
Here interactions of ordinality two amount to only 0743 eir 
the total amount of information in the data. Interactions we ry 
four measure .9780 bits and account for the remaining iiie tnn A ia da 
and quintenary interactions are absent. The account eS fede 
analysis of the data in terms of pairs of yariabie dune us ντι 
important pattern and that an appropriate analytical tech Mesi 
Tespond to pattern of an ordinality of at least four. Thus e πεῖν 
locates the ordinality of the interactional content 1n given 


o mine the requiremen p i i hniques. 

ble to ἆ i i ts of aj propriate analytical tec 

San that n models rat i f uniform ordinality larger 
at al 


h components 0 
ire i i edures and hence 
than unit loops and require iterative proc 
dae t the number 
electronic computers for their evalua ote further thai 


tion. N I GN 
Of components of an ordinality of Ν' 8 W!/w!(W - w)!, increases σης 


22. 


number W of variables covered by the model, and is additionally pie 
when w = W/2. These numbers can easily exceed computational τος 
and must be kept small in practical applications. Because p 
techniques of lower ordinality are more readily available and easier 

apply, we suggest evaluating such models in the order of their rcd 
ordinality—that is, first m', then m’, then 3, and so on—until el à 
practical analytical procedures of that ordinality are no longer availal 

or satisfactory amounts of information are accounted for. Exceeding 
the former criterion suggests that the data are too complex to be 
analyzed; reachingthe latter conditions indicates the ordinality an appro- 
priate technique would require. For other heuristics see Conant (1981). 


Searching for Optimum Models 


Here we describe a general search al, 
in turn develop three variations of 
general algorithm are 


gorithm for data explorations and 
this procedure. The steps in this 


(1) Start with some model mi (this model could be the saturated 
model m, containing all complexities in the data, the model m" as 


obtained from the previous procedure, or any model suggested in 
theoretical writings, for example). 


(2) Compute the next generation of descendents m; of the model mi 
that conforms to the desired characteristics of the models to be 
explored. (The implementation of thealgorithm varies with these 
characteristics.) 

(3) For each descendent model mj 
their statistical Significance, or w 
tion criterion for the Search proi 


(4) Unless a termination criterion is reached, enter the (set of) 
model(s) m; for which I(mi—mj)is smallest as the next ancestor(s) 


of mi into step 2 above. The most obvious termination criterion is 
that the quantity I(mo—m)) of information omitted is statistically 
Significant. Another criterion is that the quantity I(m,-m;j) 
exceeds a certain Proportion of the total I(m,—mi;), and so on. 


compute I(m;—m;j), I(m.—mj), 
hatever may serve as a termina- 
cess, 


I! no longer justifiable. Now 
tations of this search. 
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All i 

E models are considered possible except that each model 
E ue [^ Ade em variables selected for analysis; no variable is 
‘ led. The fo owing algorithm in place of step 2 

immediate descendents: ° ec 


Given a model m; = Ki: Kx:...K, of r components Kr. 
E. the component Κι have w variables VE Vm. VT 
or each Κι, f = 1, 2,..., r, for which w > 1, generate an immediate 
descendent as follows: 
(a) Replace Κι by the string Kr- V^ Kc - V”:...:Kr- V" in which 
each K - V omits a differen 
interaction «Κι» from Κι. 


(b) Remove any K - V that is now re 
remainder of m; and enter the result as an imme 


dent mj. 


The resulting set of models m; is the s! 
the model mi. 


dundant relative to the 
diate descen- 


et of immediate descendents of 


D. Decomposing the first component 


yields AB: AC: BC:CD without redundancies. Decomposing the second 
Component yields ABC:C:D in which Cis redundant. Hence ABC:CD's 
two immediate, descendents are AB:AC:B( CDi Oe ABC:D. The 
former omits the interaction <ABC>, the latter omits the interac- 


tion <CD>. 
When this algorithm for generating immedia ] 
model is entered in step 2 of the general search algorithm and applied to 
h proceeds as shown in Figure 37. 


the election data in Figure 3, the searc 
Here A and Α΄ are party affiliations (R = Republican, D - Democrat) 


obtained during the first and second interview, respectively, and P and 


P’ are preferences (+, -) for Willkie expressed at the same two times. The 
models with the smallest error in any one generation are starred (asterisk) 
in the righthand column of Figure 37 and taken to be the ancestor(s) of 
the models generated at the subsed ing the sequence(s) of 
models with the smallest errors, We order interac- 
tion <APA’P’> is removed in step 1, all 
are then removed in steps 2 through 5, and 


actions, <AP’>, <PA’>> and <AP>arerem 
Α΄Ρ΄ as the simplest model with still insig- 


plifications er! 
have to stop there. The 


yielding the model AA':PP': 
nificant errors. In all further sim 
the search may 
an artificial distribution that closely 


mentally significant, 
remaining componen a 
Figure 38 depicts the observed and 


approximates the origi 


For example, m; = ABC:C 


te descendents ofa 


ts generate 
nal data. 


su 8100" su 6100: dd: Va: d, VV: dV 
0010’ 1610" su 7610" d, Vd, VV 
su 8100’ su 6100: ,d,V :,dd:,dV:, νὰν 
su 6100᾽ su 0c00* ,d,Vd:,VdV 
su L000" su 8000" ,d,V :, Vd: ,ddV :, VV 
" su 0000: su 1000 ,d, Vd :,dV:dV:, VV 
1000* 899p T000* 6995 ,d, Vd :,ddV t 
su 9000° su L000* 
su 8100" su 6100’ 
* su 0000’ su 1000" 
su 9000* su 1000’ ,d, V :ddV :,VdV 
. su 0000* su 1000 ,d, Vd: dV :,VdV 
* su 0000* su 1000 £ 
* su 0000* su 1000* 
Ν su 0000" su T000° 
* su 0000* su 1000" 
su 1000* su £000 ,d, VV :,ddV :, VdV c 
3 su 1000" su TOO0" cU 7d, Vd :,d, VV :,ddV :,VdV I 
0000° ;d, VdV 0 
10]800uy ET (fare fury “SIs (fur — w) furjopow dag 
3XoN 
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Leom Q 


1000’ EILL 1000’ STISI (= d: Vid iv ΤΙ 

* 1000* L69v* 1000* Ws" did: VV 

1000’ EILL 1000* ΒΤΡΤΤ dd: V: V 01 
* 1000* τινε 1000* SOLE” 

T000* L69v* 1000* τεθν' 

1000’ [377^ 1000* Lv6L 
* su 1500’ su vtco ,d,V:,dd:, VV 

T000° ἰ595᾽ 1000* Ores ,d,V:,dd:dV 

1000’ 1ος 1000’ vc8ac 

1000’ IHT T000* 66ST" ,dd:,VV:dV 

0010* £610" su £€tO iV + dd? Vd: WV 

T000° 66{5᾽ 1000* 6ε8ς᾽ ,d,N *,dd:, Vd: ἂν 
* su εντο᾽ su ESTO ,d,V:,dd:, VV:dV 

1000: 08LC su 0787" μα ντ Va: VV: dy 

1000* SsST 1000* S6sr dd: Vd: VV: ἂν L 

0010’ £610" su TITO d, V:,dd:, Vd:,dV:, VV 

1000* 989v" 1000* SoLv ,d, V τα" Vd: ἄν" ἄν 
* su Tc00* su 0700" 

su 6510’ su 8LTO* ,d,V:,dd:,dV:, VV:dV 

T000' 08LC 1000' 66LC d, Vi νά: dV :, VV: dV 

1000’ Tro" 0010’ 09v0* dd:,Vd:,dV:,VV:dV 9 
* su 8100* su 6100* τη 5 ἆ V + dd: Vd: dV 5, VV : dV 

su 6100* su 0z00° id,Vd:,WV:d¥ 

1000* Ρ89Ρ᾽ 1000* S89" ,d, Vd:,dV:dV 

0010’ C610" su £6I0* 


,d, Vd :,dV:,VV S 
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A' E P r 
R D + = + = 
R|166 4 R|142 27 +| 143 16 
” P 
$ D 3 93 r D| 15 82 _ 14 93 
Components Ki: AA’ A'P' pp’ 


129 3 
(127.0) (39) (0.6) 
ll 23 0 
(2.5) (22.6) (0.0) 
1 0 12 
Q.3 (0.1) (13.1) 
1 1 2 
(0.3) (04) (1.3) 
Data mo: APA'P’ 
Model mj: (AA':A'P':PP') 
Figure 38 


model-generated fre 
eters of that model. 

The substantive co 
that voting is marked fi 
second and independent of this b 


quency distributions and above it the three param- 


ct of polarizing the population; hence A'P 
Tepresents more informati 
model, whereas AP does not, Higher- 


order explanations are unwar- 
ranted here, 


Note that the above Search for an optimum model started with mo = 
APA'P' and mad, i 
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ee For example, had one first determined where significant 
inalities are located, one would have found that interactions of 
ordinality three and four are insignificant in this case (see m? at step 1 
and m? at step 5 in Figure 37) and starting with m? would have yielded 
the same result with half the computational effort. Or, starting with a 
model that represents certain theoretical propositions, one could have 
ETE whether simplifications of this model are empirically justi- 
τ. e. In this example the models produced in steps beyond 8 are no 
nger justifiable in any case. 

The algorithm for generating models co 
four including algorithms for generating mo 
variables and models without loops (Krippendorff, 19824). 
tween two kinds of variables: 
dictor or independent variables. 
lain the criterion variables, and 
y in reference to this aim. 
The algorithm for gener- 
2 in the general search 


vering all variables is one of 
dels that are selective about 


Regression models distinguish be 
criteria or dependent variables and pre 
Predictor variables are intended to exp 
structures within the latter are explored onl 
Here we consider one criterion variable only. 
ating such models, now taking the place of step 
Procedure, is as follows: 


Given any regression model mi = ZLi:ZLz:...:ZLr:Lo, 


where Z is the criterion variable, 


variables, and Li, Lz, ... are contained in Lo. 
are the saturated model m, = 


The two extreme cases of such models 1 
ZLo, which includes all relations in the data, and Z:Lo, which 
excludes all relations between the two kinds of variables. 


Let L; have w variables V”, V”,..., NM. 


For each ZL; of m, f = 1, 2... ^ taking one at a time 
i Lr- V" in which each 


(a) Replace ZL; by the string Lr- Vr: L = V”: L: ν a 
L - V omits a different variable, thus removing the interaction 
«Lo» from ZLr. If w = 1, Le- V omits that one variable. 


(b) With each Lr - V resulting from (2) associate the criterion vari- 


all predictor 


able Z. 

(c) Remove any Z(L+ - V) that is now redundant relative to the 
remainder of m; and enter the result as à next-generation regres- 
sion model mj. 

The resulting set of regression models m; is the set of next-generation 

descendents of mi. 

We exemplify the steps involved with the television and aggression 

data used previously. Aggressive behavior A' is taken as the criterion 

variable Z, and the three variables E, A, and E' are taken as the predictor 
variables for Α΄. The steps are shown in Figure 39. Here Lo= EAE’ is seen 


6€ om3t4 


1000’ 9€0t T000* 6tcr ΥΕΝ L 
* 0050’ TITO su £0c0* Vd: V ν 
1000° 600+" 1000* TOTH ΥΠ: 9 
su 8500᾽ su ΤΡΙΤΟ’ ava: ανν ν 
T000* TH6E' T000* Scot" ΥΠ. H ,V:H,V 
* su 6000' su 7600" dVa3:V Υ:ΠΥ s 
* su SS00' su £800* 3VH:H V:V ντα ιν 
T000° LS6€' T000* S86€ dVd: dd ,V r 
u su TI00' su 8c00* AIVS:V,Y:S.V 
su v*S00* su TL00° dVH: H2,V:VH,V £ 
su 80060’ su Sco0 ναι dV V: παν 
su ζς00᾽ su 6900* ΉΥΉ: ἣν ν:ΥΠ,Υ 
* su 0000* su LTOO' HVH: Hd V:VH ν [4 
* su 6100’ su ἑ100᾽ ανα. HV,V:HH,V:VUV T 
0000° HV .V 0 
10]soouy “BIS (fu — T7 funr "Bis (fur — u); fu topo ἄοις 
1ΧΟΝ 
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719 


^ b ed asa separate component in each of these models. It accounts for 
EL among the predictor variables unrelated to A’. The 
ing components contain interactions involving A’ and subsets of 
the variables E, A, and E”. The simplest model with the least amount of 
error turns out to be A'E:A'A: EAE! and is found in step 5. It relates 
television violence and aggressive behavior both separately to subse- 
ws aggressive behavior. Step 6 shows that the ΑΈ component, relating 
: vision violence to subsequent aggression, is the weakest and its 
mission would lead to a barely significant error (.05 level), a finding 
already depicted in Figure 35. 

The above algorithm is the simplest 
Tegression analyses (Krippendorff, 1982b 
kinds of contributions (cumulative, ordinal, unique, the above contribu- 
tions being additive), on multiple criterion variables, or on situations in 
Mee two or more classes of variables are considered predictors of each 

er. 

Partition models. Partitioning aims at grouping variables into mutu- 
ally exclusive subsets of minimal statistical dependence. By identifying 
subsets that are nearly independent, partitioning may point to variables 
that can be described separately, without loss or with minimallosses and 
at a substantial reduction in analytical efforts. Partitioning also yields 
results consistent with the notion of a hierarchy of part-whole distinc- 
tions or of subsystem-system relations and serves the common problem 
of understanding a whole by its “natural” parts. We describe the 


algorithm as follows: 


one of several other forms for 
) that could focus on different 


n an initial set of 


Given any model m; whose components K partitio 
may be the satu- 


variables into mutually exclusive parts. Initially mi 
rated model m, and ultimately it becomes Mina. 

t K of m; that contains more than one variable, 
apply the general search algorithm with the algorithm for generating 
all *models" on the path toward an additional bipartition in place of 
step 2 and replace K by L:M resulting from K's partition. 


The algorithm for generating all *models" on the path toward one 


bipartition initially accepts the component K = L:M, proceeds 
through “models” LS:SM, where S denotes variables shared by the 


two components and L and M are unique, and terminates with L:M 
whose parts are mu ntly cover K’s original 


variables. 
Let V' and V" denote two variables in S and let S-Vsaythat variable 


V is removed from S. 


On each componen 


tually exclusive and joi 
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Initially: for each pair of variables V’ and V” in K replace K by 
(K - V):(K - V”. 
All resulting forms LS:SM are initial descendents of K 
on the path toward its eventual bipartition. 
Otherwise: foreach variable V of S 


replace LS:SM by LS:(S - V)M and by L(S - V):SM 
The resulting forms are subsequent descendents of K on 
the path toward its eventual bipartition. 


The lattice in Figure 40 contains all possible structures of models on 
the path toward one bipartition of seven variables. The number of 
descendent models that need to be evaluated to determine the next step 
are indicated therein. All paths terminate in any one of three kinds of 
bipartitions, numerically with 6:1, 5:2, and 4:3 variables. The quanti- 
tative criteria guiding this search are as in the previous searches. One 
considerable advantage of this algorithm is that none of the models 
involved contain loops and can thus be evaluated more efficiently than 
those that do. 
Figure 41 depicts the results of reapplying the algorithm to each part 
a partition of seven variables, ultimately achieving the complete 
composition into separate variables. The lattice does not show the 
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Figure 41 
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models that are intermediate to these partitions. For example, the whole 
lattice in Figure 40 is summarized as the first step of the process shown in 
Figure 41. The informational difference between the original data and 
the first bipartition expresses the interdependence between the two 
principal parts of that partition. Subsequent differences express addi- 
tional interdependencies between the two parts of a finer partition, thus 
revealing a hierarchical account of interdependencies. Such differences 
often are interpreted as measures of communication between the two 
subsystems in the context of a (sub)system to which both belong. The 
sum of these quantities along any one path in this lattice equals the total 
amount of communication within the whole system, and this total is 
invariant to the order by which the partition was obtained. 

Again, there are variations to the partitioning algorithm. For exam- 
ple, one may take not absolute but relative information measures— 
as in 4.9 ——as decision criteria, thus favoring partitions whose parts 
are similar in size. One may combine the algorithm for generating 
bipartitions with the one for generating regression models and achieve 


partitions among predictor variables, and so on. 
Beyond the three kinds of implementations of the general search 


algorithm, note that any class of models whose structural properties can 
be formally stated and incorporated in a process of generating descen- 
dent models can be subjected to incremental simplifications. Researchers 
may want to define their own problem of exploration in these terms and 


follow the algorithm outlined above. 
A point of caution is needed here. The number of models that can be 


defined and must be evaluated during explorations can become large 
even when only moderate numbers of variables are involved. The 
stepwise and incremental approach followed by the search procedure 
reduces the computational effort considerably. But even here computa- 
tional limits are approached rather quickly. This author's computer 
program for confirmation handles up to 10 variables with no more than 
10 values each and up to 10 components. Conant (1981) has been 
working on a computationally more efficient approach (which cannot 


be presented here). 


Algebraic Techniques 

These techniques go back to work done by Ashby (1965, 1969) and 
Conant (1976) and are known to apply only to models without loops 
(Krippendorff, 1980) and without structural zeros. When models do 
contain loops, iterative procedures are required, as discussed in Chapter 
12. Algebraic techniques have the advantage of computational effi- 
ciency and lead to simple conceptualizations. We extend here some of 


the forms introduced in Chapters 8 and 12. 
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i inds of 
The most simple identity between entropies and the two k 
expressions for amounts of information is 


2 
I(m, > m) = H(m,) - H(m,) = T(m,) [14.2] 


i is identit. 
It does not hold for models with loops (8.3). We p a i 
to any two models that are loopless, cover the same variables, 


Im^m)-H(m)-H(m)- T&,)* Te) *...* T) *... [143] 


X, =K,&K HK & K, : K &K:., . redundant parts eliminated 
f a 
T(K)=0 


impli i i iables 
We also introduce a notational simplification by entering varial 
shared by all components of a model as subscripts: 


T(LK, LK :LK,:.. .) = T,(K,:K,:K,:...) [14.4] 


ifi i i i iables of 
which identifies variables in L as the controlling yana es 
Ki Ko: Κα’... Forexample, for m;z ABCD: BCDE—m)= ABC:BCD:CE 


X, = ABCD&ABC: ABCD&BCD: ABCD&CE = ABC: BCD 
x, = BCDE& ABC: BCDE& BCD: BCDE&CE = BCD:CE 


and 


I(m,? m) = H(mj) - H(m,) = T(ABC: BCD) + T(BCD: CE) 


= Ty .(A:D)+ T. (BD: E) 


wherein the two T-measures assess communication between A and D 
and between BD and E, both of which are present in m; but absent from 
mj. Equation 8.8 exemplified the application of 14.3 to ααπιθ EA 
chains. In both cases T-measures cover different variables. Equation 14. 
states a fundamental relationship between the I-measures, which have 
me covers in their arguments, and T-mea- 


descendent models of the sa : 
sures, which express dependencies between the variables involved in the 
differences between these models. 

Algebraic techniques for ex; 
amount of information into ad 
ties collectively designate 


models (also see Figure 4 


Ploration essentially decompose a total 
ditive quantities, These additive quanti- 
One or more paths through a lattice of loopless 
3), summarize the information losses of several 
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intermediate models (without actually evaluating them explicitly), and 
thus help the researcher to find paths along which strong structures 
exist. Figure 42 shows the T-measures along the two paths between 
ABCD:BCDE and ABC:BCD:CE. If a quantity is insignificant, all 
intermediate models are not worth exploring and may be ignored in the 
search for an optimum. If one or more of these quantities is statistically 
Significant, asystematic top-down search for the optimum can start with 
the simplest descendent model implicit in the insignificant T-measures. 
In the example, if Tc(BD:E) is and T&c(A:D) is not significant, then one 
would initiate a search with the model ABC: BCDE. 

We now state three identities and derive a fourth, all of which may be 
used for algebraic exploration in a lattice of loopless models. The first is 


the extension to more variables: 
T(A:B:C:D:...) = T(A:B)* T(AB:C)* T(ABC:D)*... [14.5] 


It equates the total amount of information in data T(mina) with aseries of 
binary transmission terms, each covering one variable more than that 
preceding one. The order of variables being arbitrary, numerous enu- 
meration schemes are possible. In the three-variable case, 


T(A:B:C) = T(A:B) + T(AB: C) 
= T(A:C) + T(AC: B) 
= T(B:C) + T(A: BC) 


Each identity evaluates at least one path in the lattice of loopless models 
between ABC and A:B:C, which bypasses all models with loops, and in 


this case also the models AB: BC, AB: AC, and AC:BC. (See Figure 25 
for the complete lattice.) 
m, = ABCD: BCDE 
T, (A: D) 
ABC; BCDE Im, mj) = H(mj) = H(m) 


€ 
= H(ABC: BCD: CE) - H(ABCD: BCDE) 
ABCD:CE 
T (BD: E) 
Ty (A: D) 


m, = ABC: BCD: CE 


Figure 42 
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ER " vi 
The second identity concerns partitions into mutually exclusiv 
parts: 


:B:C:...:L:M:N:...) = T(A:B:C:...)+ T(L:M:N:...) [14.6] 
a TR Ee crc 


It suggests that the total T(mina) can be broken down into the sum of the 
amounts of information within each part plus the amount of informa 
tion between these parts. This identity underlies the search algorithm n 
partitions and the path(s) such an identity evaluates can be envisione 
by means of Figure 41. 


The third identity is related to the regression of one variable in terms 
of all others: 


T(ABC...:Z) = T(A:Z) + T, (B: Z) [14.7] 


*T,g(C:2)*...*T 


Asc. (Y: Z) 


It expresses the total amount of information between one variable and 
all others as a function of the information between that one variable and 
one other, that one variable and a Second other controlled for by the 
first, and so on. Applying 4.5 to the left side and bringing the condi- 
tional entropy Hasc..(Z) in T(ABC...:Z) to the other side yields 


H(Z) = T(A: Z) * TA(B:Z) [14.8] 


*TAR(C:Z)* .. 4 Tage, (¥:Z) + H, sc. 6) 


where H(Z) is the entropy in the variable Z to be explained, Hasc...(Z) 1 
the unexplainable entropy or noise in Z, and T terms are the incremental 
contributions to H(Z). A stepwise regression procedure naturally 
follows from this equation. It would start by searching for a variable Y 


for which Tasc..(Y:Z) is minimum, then search for a variable X for 
which Tanc...(X:Z) i 


advisable given that higher-order 


interactions would escape this measure, whereas the conditional mea- 


sures use them as controls.) 
Applying now the iden: 


I tity of regression 14.7 to each of the parts 
obtained from the identity 


of extension 14.5 gives the following account: 
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m B) + T(AB:C) + T(ABC: Ὁ) + T(ABCD: E) 1... 
4 4 4 


=T(A:B)+ T(A:C)+ T(A:D)+ T(A:E)..—; [14.9] 
* T,(B:C)* T, (B: D)+ T,(B:E)+... 
+ T,,(C:D)+ T. (C:E)+... 
+ Tage (D: B) +. 
Tuas: 


*"CT(ASIBICIDZE:..) 


Each of these terms represents an informational difference between two 
descendent loopless models that are one generation apart. These quan- 
tities partition the total amount of information into the incremental 
information losses along some path through the lattice of possible 
loopless models from mo to Mina. Again, variables can be taken in any 
order, and 14.9 can be used to evaluate any path through this lattice. 
Moreover, and inasmuch as some of these terms can be rearranged and 
applied to different models, a given set of terms may be shared by several 
such paths. Figure 43 depicts the 16 possible paths for which the terms in 
14.9 can account. For simplicity, this figure represents the same T terms 
by the same kind of line and at the same angle, thus showing the different 
positions these terms may occupy along these paths. Finding, for 


ABCD 


T. (C: D) 


ABC ABD ABD: ACD 
ον. 


T (B: D) e, 
em £ 
ABC.RD ΑἹ ABD:CD 


ABD:C AB:BD:CD 


AD: B A: BD: CD. 


Figure 43 
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example, that Ta(B:C) and T(A:B) are significant whereas all others = 
not would suggest that the optimum model lies somewhere Lie 
ABC:D and AC:B:D from both of which D could be ignored as 
noncontributory variable. . 

Even though algebraic techniques are restricted to loopless nes 
they do provide useful tools for exploring the complexity of multivari 


: : : ar uc 
data and become essential outside the computational limits for eval 
ating models with loops. 


15. COMPARISONS WITH 
ALTERNATIVE APPROACHES 


Network and Path Analyses 


The information theory for the structural modeling of qualitative 
data has much in common with network analysis, path analysis, and the 
Structural modeling approach to quantitative data: All respond to the 
need to make relations in multivariate data transparent. Network 
analysis, for example, largely starts with bivariate data, such n» who 
talks to whom and how often, distances in space, or differences in time 
or in other magnitudes; aggregates such data much as graph theory 
does; and then identifies chains, loops, bottlenecks, centralities, and so 
9n but is unable to consider relations of ordinality higher than two. 
Higher-order causes or Consequences sometimes enter the path dia- 
grams as complicating phenomena, but because its arrows link variables 
in pairs, the approach is basically focused on bivariate explanations of 
multivariate phenomena. (Try drawing a line connecting three points 
other than in pairs!) The coefficients of structural equation models, 
briefly discussed in Chapter 6, do not need to but often do express linear 
relations between pairs of variables and are thus similarly limiting. The 


information theory approach is Not so restricted, however. It considers 
binary relations merel 


test theories—especi 

unlimited complexity. 
Also mentioned in Cha 

theoretical models do not 


gh other mathematical idealizations, 


the parameters of information-theoretical models are the very distri- 
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butions found in the multivariate data themselves, without any sim- 
Plification. It follows that any kind of relation, linear/nonlinear 
unimodal/ multimodal, deterministic/probabilistic, and so on, is ie 
served in the distributions our models generate. The information theo- 
retical approach is hence entirely general. 


Chi-Square 

Chapter 11 explored the role of the maximum likelihood approxi- 
mation L? in providing information theory with access to the familiar x 
tables. All three quantities 


mE 
=n eo 
I =}. plog, z 


L = 2n Dp log, Z = 13863 n1 


are zero when the observed probabilities p equaland expected probabil- 
differences between the 


ities and increase in magnitude with increasing 
two. x? and 12 are functions of the sample size n; the information I is 
independent of it. 

Despite these functional similarities, 
L’, the latter being a mere multiple of I, 
particular, differences in x? values are un 


a major difference is that I and 
are additive in ways x’ is not. In 

interpretable, whereas many 
differences among information quantities yield other information quan- 
tities that can be subjected to the same tests as thequantities from which 
they were derived. Thus as Ku and Kullback (1974) and many others 
have concluded, information quantities provide researchers with an 
analytical flexibility unknown to x^ users, and information statistics are 
therefore often preferable to x’ statistics. 

Another difference lies in the magnitude individual cells contribute to 
the two measures. Cell contributions to x are known to become unjusti- 
fiably large when expected frequencies (which appear in its denomi- 
nator) are small or nearly absent. This is the primary reason why the use 
of x’ statistics calls fora minimum of five expected observations per cell 
(strict condition) or an average of five or more observations per cell 
(weak condition), which, in the context of multivariate analysis, often 
makes exorbitant demands on adequate sample sizes. In contrast, the 
log-likelihood contributions to information quantities are weighted by 
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the observed frequencies in a cell, and each contribution is therefore 
proportional to this frequency (see Figure 15 for examples). 

Although information measures become biased as well when samples 
are inadequate, they tend to overestimate the true quantities involved. 
In significance tests, such biases make the rejection of null hypotheses 
(that observed and expected probability distributions really are the 
same) more likely when they are true in fact and thus favor models that 
are more complex than necessary and include the correct model as one 
of their descendents. Chapter 11 concluded that in the structural 
modeling context inadequate sample sizes render information statistics 
not inappropriate but merely more conservative. 


Analysis of Variance 


The similarity of entropy as a measure of diversity or variety and 
variance was explored in Chapter 3, which suggested that entropy, 
implying no assumptions regarding the order or shape of the distribu- 
tion in data, is the more general of the two. This argument is further 
strengthened by the fact that information quantities for interval data or 
"continuous channels" have been proposed in the original work by 
Shannon and Weaver (1949), whereas a converse proposal for applying 
the analysis of variance to qualitative data or "discrete channels" is 
unavailable. Despite these differences there are interesting similarities 


that stem from their respective logic of partitioning variation. In the 
analysis of variance 


Vi = Vt V^ + VP + VAB. yo 4 VAC 4 VBCLVABC. [15.1] 


where each effect is defined independent of all others—for example, Vee 
excludes what V^ and V? contribute and is in turn excluded from the 
contribution by V“"°, thus accounting for the unique effect of the 
interaction <AB> on the criterion variable, say Z. For W predictor 
variables 15.1 has 2” terms. 

A partition of information quantities resembling 15.1 is found in 
14.7, wherein the total amount of information in Z, T(ABC...:Z), corre- 
sponds to Vio, T(A:Z) corresponds to V^, TA(B:Z) corresponds to V" + 
V^, Tas(C:Z) to VS + V^C4 yc, VABC and so on. Evidently both forms 
are capable of accounting for the variation in one variable by parti- 
tioning others. However, the information identity 14.7 accounts for 


these Components in groups, The reason for this property will become 
clear in the following comparison. 
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Log-Linear Modeling 


Log-linear modeling (Goodman, 1972; Bishop et al., 1978) is related 
to information theory as well. It proposes an additive functionsimilar to 
15.1 for explaining not the variation in one variable but the frequencies 
in the very multivariate space it partitions: 


1o D Ka a WB. D. AG. BE ABO 
Bona utu, tu, tuy kuju, FU, T Us, cw HLS) 
expresses the deviation 


where u is the average log.nas... over all cells, us 
he deviation due to 


from this average on account of A, uf? expresses t 
AB over and above what u, uå, and u$ express, and so on. All u terms are 
logarithms of various forms of cross-product ratios, and the form of this 
function obviously resembles that used in the analysis of variance except 
that it applies to individual cells. 

For two reasons the ideal of 15.2 is unachievable. First, u terms are 
not entirely independent. Already uA? is no longer obtainable by alge- 
braic means because it would have to exclude the effects of ud‘, uá?, and 
which, taken together, constitute a loop and must be evaluated 
ese parts does not equal the whole, which 
challenges the function's additivity. Second, to obtain expected fre- 
quencies, u terms cannot be zeroed arbitrarily. Bishop et al.'s "hierarchy 
principle" (1978: 67-68) formalizes the order in which contributions may 
or must be grouped, thus curbing the freedom to assemble the u terms 
into models that the function's notations claim. 

The information theory for structural modeling provides an additive 


form as well: 


uic, 
iteratively. The sum of th 


I(m, > mia) = I(m, ^ m,) *I(m, 2m)t... *I(...— mua) [15.3] 
unt of information into up to 2"-W-1 
additive quantities, representing contributions similar to 15.1 and 15.2 
except for those associated with the W single variables (which appear 
moved here to the left side of the expression and are contained in 
I(mo—Mina)) and the overall term, which is zero. If models are immediate 
descendents, then each informational difference measures the contri- 
bution of exactly one interaction, just as in the analysis of variance and 
intended by the log-linear ideal, but it measures these always in the 
context of the model from which itis removed, thus implying that these 
contributions are ordered. The notation m~m mM... Mina indi- 
cates a descendence ordering of those contributions and designates one 
path through the lattice of possible models. There are therefore up to as 


It partitions the total amo 
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many additive functions of the form 15.3 available as there are paths 
through such a lattice. | x 

On the surface, the absence of a single accounting equation and the 
context sensitivity of the additive quantities involved might seem disad- 
vantageous. However, this merely recognizes Bishop et al.’s ΠΝ 
principle. Whereas the log-linear approach postulates the ideal o " 
single additive function of all possible contributions and must then te! 
users that its terms cannot be analyzed (assembled or ignored) freely A the 
information-theoretical approach has built the same restrictions into its 
logic of structural models (see Chapter 6) and into its accounting equa- 
tions that implicitly abide by this logic. With this in mind, the informa- 
tional account, besides offering a summary account for all cells, is 
simpler than the log-linear approach. I surmise this difference to be also 
one of style: The log-linear approach grew out of the traditions of 


analysis. The information-theoretical approach grew out of an iterative 
exploration of data. 


The Most Basic Reference Possible 


Finally, throughout the book we stated all informational accounts with 


reference to models of the same cover. This places ming at the base of the 
lattices of models considered here and defines I(mo—Ming) = T(mina) as 
the maximum amount of information such models can explain. It 
disallows the simple omission of variables and prevents an accounting of 
the contributions such omitted variables make. For practically all struc- 
tural modeling tasks this reference is sufficient and we chose it for this 


very reason. However, nothing prevents an extension of the information 
quantities to models with different covers that claim no knowledge 
about the distribution in some (or all) variables (Krippendorff, 1981) 
and to state individual cont 


ributions in terms of 2” logarithmic func- 
tions of frequencies analogous to 15.2. 


[15.4] 
ΝΑΝΗΝς. 

P, Ρ 
* log, zi + log, EN * log, s = 

N. N, a P, 

Pe Pave 
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Here nisthe sample size, NA is the number of occupiable cells in A, and 
m is the probability generated by the model AB: AC:BC (which con- 
ains a loop). log;zn/ NANsNc.. resembles the general u term, and all 
other parts of 15.4 are log-likelihood ratios whose non-zero values 
indicate the magnitude and direction of an effect on logona..... Figure 44 
exemplifies this account. 

Averaging the parts in 15.4 and moving the log;n/ NANsNc... term to 
the left side of the identity yields 2" _ | information quantities. In the 
three-variable case 15.4 becomes 


T(ABC) = T(A) + T(B) + T(A: B) + T(C)+ T(A:C) 
+ [T, (B: C) - T(AB: AC: BO] +T(AB: AC: BC) 


[15.5] 


where the redundancy T(ABC) is the informational difference between 


the presence and the total absence of any knowledge about a probability 
distribution in selected variables, T(ABC...) = I(m;— max) generalizes 


4.10, and the other T measures are as usual. An analysis of the Florida 
murder trial data in Figure 1 demonstrates the use of these forms. Fi igure 
th (as defined by the search 


44 shows all models along the optimum pa 
algorithm in Chapter 14), the frequency distributions in their param- 
eters, and the maximum entropy frequencies generated by them jointly 
(rounded to full integers). The meaning of Mmax as the ultimate descen- 
dent may also become clear in the latter distribution. The race of the 
murderer is A, the race of the victim is V, and the outcome of the trial is 
O, as shown in Figure 4. 

The figure also illustrates the use of 15.4 to explain the cell of 48 cases 
in which the murder victims are white and the perpetrators of the crime 
are black and sentenced to death. Its eight terms sum to log;48 = 5.5850. 


Knowledge of <O>, that death penalty is rare in comparison to other 
V>, that interracial violence is less frequent than 


intraracial violence, accounts for the largest deviations and indicates 
that observed frequencies are less than those assumed by the uniform 
distribution. Knowledge of <VO>, that death penalty is more likely 
when victims are white, and of < AO, that murderers are more likely 
sentenced to death when they are black, accounts for increases in 
frequency in this cell. A third-order interaction is absent. 

The figure also contains a complete account according to 15.5 of the 
total amount of redundancy T(AVO) = Kms ms) = 1.4421 bits. Only 
the third-order quantity turns out to be zero. All descendents of the 
model AV:AO:VO exhibit statistically significant errors and cannot be 
accepted as adequate models of these data. Because this simple 2X 2X2 
example has few degrees of freedom, 15.4 and 15.5 yield essentially 
similar insights. This may not be so when variables involve finer dis- 


tinctions and cells make very different contributions. 


outcomes, and of <A 
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Log.Likelihood 
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In conclusion, the information theory for structural modeling has 


aims similar to such traditional approaches as network and path 
analyses, x^ statistics, analysis of variance, and log-linear modeling but 
accomplishes them more elegantly, provides greater analytical power 
and flexibility, retains more direct touch with the (mathematically) 


uncontaminated data, and Suggests interpretations closer to social 
theory, to communication theory in particular. 
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